Abstract
Patients recovering from stroke often struggle with their rehabilitation exercises at home, with personal therapists being both costly and potentially unavailable. Virtual rehabilitation programs may assist the patients by providing quantitative and qualitative feedback. The critical step for these automated systems is identifying and segmenting the performed activities for repetitions, making repetition counting an essential component. For this, an affordable, robust and generalized system is required. Previous works utilize expensive systems like Vicon and Kinect (discontinued), or require extensive training and are incapable of live counting. In this study, we propose a live repetition counter that works on RGB videos with Mediapipe, used for joint extraction. Also, we utilized action recognition (12 activities) using a lightweight transformer-based model. We employed a four-axis variance mechanism to monitor the motion across any arbitrary axis combined with the autocorrelation method to identify repetitions across random patterns. Our system was evaluated on a custom RGB dataset as well as the benchmark datasets UI-PRMD and KIMORE, achieving mean MAE values of 3.2733 and 1.4467, respectively, along with real-time validation through preliminary experiments. For activity recognition, the model achieved F1-score value up to 0.996. Relying solely on the RGB camera, our approach ensures practicality for home rehabilitation.
Similar content being viewed by others
Introduction
Physical exercise offers numerous advantages and hence plays a vital role in improving overall well-being. Its positive impact on both mental and physical health1 has been highlighted in a wide range of studies. Despite having numerous benefits, the majority of the global population struggles to maintain a consistent exercise regimen2. Above general fitness, physical exercise is the cornerstone for rehabilitation for illness, trauma or a stroke. Patients initially undergo in-clinic rehabilitation under the supervision of doctors and health professionals. Following in-clinic rehab, some patients are often required to undergo home rehabilitation. Home-based rehabilitation is crucial for the patient to regain mobility, improving joint function and pain management more flexibly and efficiently3,4,5. Its importance has been well documented, especially in the context of recovery after orthopaedic procedures such as knee and hip surgeries6,7,8,9. Despite its effectiveness, access to rehabilitation services remains limited for many due to financial constraints, lack of available professionals, and personal barriers such as motivation or confidence10. These challenges highlight the urgent need for affordable, accessible, and user-friendly home-based rehabilitation solutions.
In countries like Germany, patients are provided aftercare and rehabilitation offers like a multimodal intensified program (IRENA) or training rehabilitation aftercare (T-RENA), but half of them actually take part11. An automated virtual assistant can guide and support patients through rehabilitation exercises, track their progress, and provide valuable data to healthcare providers for adjusting treatment plans. Moffet et al.12 and Tausignant et al.13 in their studies have shown the telerehabilitation application where patients communicated in real-time via video conferencing. Other studies have shown the use of sensors like Kinect for real-time feedback to the patients by analyzing the patients’ motion and activities14,15 also referred to as Human Activity Recognition (HAR). HAR is also a part of physical fitness training and rehabilitation16,17. Centre to these programs are regular and repetitive exercises that allow patients to regain their natural mobility and strength18. With advancements in technologies like Artificial Intelligence (AI), AI-driven systems offer a promising approach as virtual assistance in providing rehabilitation at home19. These systems utilize various sensors to track the body motion. Various algorithms and methods are further employed to analyze the patient’s motion to give quantitative and qualitative feedback to the users20,21. In rehabilitation, patients are often prescribed particular exercises with a defined number of repetitions and sets in a certain manner describing the quality of motion. The evaluation criteria involve adherence to the prescribed repetitions/sets22 with correct motion execution, consistency in performing these exercises, and maintaining proper body postures. Consequently, human activity recognition (HAR), activity repetition counting (ARC), and human motion quality assessment (HQA) have become essential components of virtual assistance systems. We have addressed the third component, human motion quality assessment (HQA), in our previous work23 using an enhanced spatio-temporal transformer framework to obtain quality score values. This work focuses only on the first two components HAR and ARC.
In this study, we propose a low-cost RGB camera-based activity repetition counting system using autocorrelation and peak detection for its use in home rehabilitation. The system is a live counter as it performs live repetition counting on pre-recorded videos without requiring future frames. Hence, simulating the real-time operation by processing frames sequentially with immediate repetition counts. Our system can work in real-time as well using a webcam input as verified in the preliminary experiments. This is beneficial as the user/patients can identify the exact frames or times when they made an error in performing the rehabilitation exercises in the video or in real-time. Additionally, we trained a transformer encoder-based deep learning model for activity recognition. We employed mediapipe24 open-source framework for extracting joint coordinates which were subsequently used for model training and repetition counting. The proposed system does not require prior training for repetition counting. The key contributions of the study are as follows:
-
We propose an affordable RGB camera-based live activity repetition counting and recognition system that uses autocorrelation and peak detection techniques for repetition counting and a transformer encoder-based deep learning model for activity recognition. The proposed system does not require prior training for activity repetition counting.
-
We provide a dataset of eight healthy individuals repeating 11 different activities (involving rehabilitation exercises) captured using a smartphone.
-
We introduce a novel combined vector method to obtain an overall repetitive motion in the spatial domain.
-
Additionally, we utilized a four-axis variance-based directional monitoring system to count repetitions along any arbitrary direction.
-
We evaluated our proposed approach on our custom RGB dataset. To establish the generalizability and robustness, we evaluated our system on a publicly available datasets UI-PRMD25 and KIMORE26 and compared the results with other state-of-the-art.
-
We also conducted a preliminary study to evaluate our model using live camera input. The model demonstrated good performance in real-time when tested with webcam feeds.
This article is organized as follows: Section “Related works” provides an overview of the related literature. Section “Proposed method” describes the data collection process, joint coordinates extraction, window creation, deep learning model architecture and activity repetition counting method. Section “Experiments” describes the collected dataset, evaluation metrics and implementation details. Section “Results” includes the proposed system performance comparisons. Section “Real time evaluation” discusses the performance evaluation in real-time. Section “Discussion” analyzes the performance of the proposed approach. Section “Limitations” identifies the limitations and section “Future scope” outlines the future scope. At last, section “Conclusion” concludes the study.
Overview of the proposed system.
This shows the output of mediapipe and shows how frames are selected.
Related works
HAR and ARC involve vision-based and inertial sensor-based systems27,28. Numerous techniques have been employed for HAR and ARC, like Andrea Soro et. al. in their study used an inertial-based sensor to track motion with a CNN model for activity recognition and repetition counting. They have achieved an accuracy of 91% in activity counting29. On the other hand, vision-based systems have also been used for repetition counting via recorded videos or live streams30,31. Apart from deep learning, other techniques like autocorrelation and wavelet transform have been used to extract periodicity to count the repetitions32,33,34. Zang et al. in their study demonstrated a technique to identify repetitions in films captured in various domains. They proposed a coarse-to-fine cycle refinement approach35. Dwibedi et. al. in their study proposed a machine learning-based system named ‘RepNet’. They chose a strategy where they confined their period prediction module to employ temporal self-similarity as an intermediate representation bottleneck that enables generalization to unseen repetitions in the wild36. However, their system has to analyse the entire video sequence before giving the repetition counting results. In addition, they choose the input video by picking up the maximum periodicity classification score. This kind of rate section scheme is not optimal for error-free counting as the resulting high frame rate selection leads to omissions. Levy and Wolf in their study presented a live repetition counter. Their system runs online and not on the entire captured video. By analysing sequentially blocks of 20 non-consecutive frames the systems estimate the cycle length using the CNN network. Entropy was used to systematically start and stop the repetition counter. They evaluated their system on real-world and synthetic videos37. Despite being a live repeating counter, their system can’t handle dynamic variations in the period cycle. Li et al. in their study, adopted the LSTM network on image sequences to learn the temporal dependency. They aimed to recover the respiratory and cardiac motion via deep learning models without fully annotating and hypothesis of the explicit motion model38. Their model can’t handle the complex real-world repetitions. In 2024, Sinha et al. proposed an exemplar-based video repetition counting method named ESCounts. In their study, they employed an attention-based encoder-decoder for learning visual correspondences between clips containing a single repetition and the full video while training. The learned latent representation is later deployed for repetition counting. Despite giving state-of-the-art performance across diverse datasets like RepCount, Countix and UFCRep, their approach heavily relies on pseudo-labels for data without precise repetition annotations. Hence, limiting localization accuracy. In addition, the model’s performance degrades for high repetition counts or long-duration videos. Also, the proposed model has to analyse the whole video sequence before giving the output39.
Taking inspiration from the above-mentioned limitations, we propose an autocorrelation-based live repetition counting system with action recognition using a transformer encoder-based deep learning model. Our system works on the RGB videos. We employed a mediapipe open-source framework for joint coordinates extraction from the video frames. We trained the deep learning model to identify 12 different activities. However, for the repetition counter, the training is not required and the repetition counter remains fully functional without the deep learning model. We utilized a four-axis variance calculating method to monitor the motion across any arbitrary axis in the spatial domain. Hence, making the system robust and generalized toward a wide variety of different activities. Furthermore, the autocorrelation technique supports the identification of repetitions for arbitrary motion patterns. By combining a lightweight transformer-based model with an efficient repetition counting algorithm, the proposed system is well-suited for home-based rehabilitation applications.
Proposed method
For activity recognition, we employed a transformer encoder-based deep-learning model. An RGB camera captures the video frames and joint coordinates are extracted using Mediapipe open source framework. The extracted joint coordinates are further processed to identify the performed activity. Additionally, the captured coordinates are analyzed using auto-correlation and peak detection techniques to count the number of repetitions. Figure 1 presents an overview of the proposed framework. A detailed description of the proposed method is presented in the later sections.
Data collection
We enrolled multiple healthy participants for our study to perform different exercises. The demographic of participants is shown in the Table 1. All the participants signed the informed consent. The study was approved by the local Institute Ethical Committee (Humans) at Indian Institute of Technology Ropar file no. IITRPR/IEC/2023/025 dated \(1^{st}\) March, 2024 and all experiments were performed in accordance with relevant guidelines and regulations. Following the protocol, all participants were asked to perform different exercises to execute a certain number of repetition cycles. The protocol included 12 activities including rehabilitation exercises40. Table 2 presents the demographic of the performed activities. Among all, 11 activities (excluding idle) were used for repetition counting and the rest were all for training the transformer encoder-based deep learning model.
All the performed activities were recorded using a OnePlus 11R Android smartphone with a 50MP rear camera. Recorded videos were captured at a resolution of 720p at a frame rate of 30 FPS. The smartphone was mounted on a tripod at a distance of 2–3 meters from the participants. In total 96 videos were recorded, comprising 1480 repetitions.
Transformer model architecture for activity prediction.
Key points detection using mediapipe
Mediapipe framework has been used for predicting the body skeleton in the current study due to its reliability and good accuracy with low latency. Garg, et al. in their study have shown the results comparison of different deep learning models for Yoga pose estimation. In their study, it has been concluded that the use of mediapipe certainly enhances the performance of the system and is found to be quite effective for pose estimation41.
Mediapipe identifies 33 key landmarks on the human body. Each landmark is represented by the vector \([x_i, y_i, z_i, v_i]\) containing three spatial and one visibility factor. The visibility factor’s magnitude lies between 0 and 1, with zero corresponding to less visibility in case of occlusion or poor lighting and 1 for high visibility. After combining all landmarks, we get a feature vector of length D = 132 for each frame captured at a particular time instance: \(\textbf{a} = [x_0(f), y_0(f), z_0(f), v_0(f), \dots , y_{32}(f), z_{32}(f), v_{32}(f)] \in \mathbb {R}^{1 \times D }\), where the subscripts represent landmarks and f is the frame index or time-point. The output of MediaPipe Pose Landmark model is shown in Fig. 2.
Windows creation (data preparation)
We utilized sliding window mechanism to divide the entire sequence of frames into smaller segments (see Fig. 2). For the input frames sequence represented by a 2D matrix: \(\textbf{A}= [\textbf{a}_1, \textbf{a}_2, \dots , \textbf{a}_F]^{'} \in \mathbb {R}^{F \times D}\), F is the number of frames in the entire sequence, and each \(\textbf{a}_f\) represents the joint-coordinate feature vector of a single frame with length D. The sliced data matrix can be obtained as:
Where, \(I_m\) is the \(m^{th}\) window matrix, u is the stride and N is the window size.
Deep learning model
In this study, we have utilized a transformer encoder-based architecture for activity identification. Figure 3 shows the architecture of the deep learning model. This architecture mainly consists of two parts: Firstly a transformer encoder and secondly the dense layers. The transformer encoder consists of multi-head self-attention and feed-forward layers. The vector \(I_m\) is given as an input to the transformer encoder.
The transformer model’s multi-head self-attention is an advanced technique intended to capture complex dependencies within a sequence42. Employing several parallel attention heads, improves on the conventional self-attention process. Each head uses its own set of query (Q), key (K), and value (V) matrices to carry out self-attention on its own. The self-attention mechanism uses the scaled dot-product formula to calculate the attention scores for a single attention head:
where \(d_{model}\) is the keys’ dimensionality. By providing a probability distribution across the data, the softmax function makes sure that the attention ratings are normalized. In multi-head self-attention, learning projection matrices are used to linearly project the input embeddings into H distinct subspaces (heads): \(Q_h = X W^Q_h\), \(K_h = X W^K_h\), and \(V_h = X W^V_h\), where the projection matrices for the h-th head are \(W^Q_h\), \(W^K_h\), and \(W^V_h\). The following is how each head h calculates self-attention:
All heads’ outputs are concatenated and passed via a final linear projection:
Where the output projection matrix is denoted by \(W^O\). The encoder layer incorporates a position-wise feed-forward neural network (FFN) after the multi-head self-attention. By adding non-linearity, this feed-forward network enables the model to further alter the representations. A residual connection and layer normalization come after both sub-layers (feed-forward neural network and multi-head self-attention).
The output of the transformer encoder is passed to the dense layers with the last layer followed by a Softmax layer outputting the activity predictions.
Activity repetition counting
For any motion to be considered periodic, a certain pattern must repeat itself after a particular interval of time. Each activity is unique and involves the movement of different body parts, which makes the repetitive counting tasks challenging.
This figure shows the body joint landmarks and the combined vector V with varying magnitudes of the participant performing an inline lunge activity. Here, \(f_0, f_1 \dots f_9\) are the frames at different time stamps.
Combined motion
To generalize the motion, we have combined the motion of all the joints by summing their velocity vectors \(\vec {v_m}\) for each frame (see Fig. 4). For our study, we included only x and y coordinates for velocity vector calculation. For matrix \(A'\) containing landmarks \(a'= [x_0, y_0, x_1, y_1, x_2, y_2, \dots , x_{32}, y_{32}]\) including x and y coordinates the velocity vector can be written as:
Here, \(\vec {v'_m}\) is the velocity vector for frame m, \(a'_m\) is the landmark obtained for frame m and \(a'_{m-1}\) is the landmark vector for frame \(m-1\).
Data filtering
To compensate for the noises and disturbances which cause the magnitude of the vector \(\vec {v'_m}\) to be very high for an instant, the x and y components \(v'^x \text { and }v'^y\) are capped/limited to a certain threshold for both negative and positive values. This capping ensures a smooth operation.
Here \(\delta _{\text {limit}}\) is the capping threshold value. The combined vector V is the vector sum of all the velocity vectors of the body landmarks. The combined vector for the \(m^{th}\) frame is given as:
Here, \(\vec {v'}_m(j)\) is the velocity vector of the \(j^{th}\) joint landmark for the \(m^{th}\) frame.
The capped values are further smoothened out using the exponential average. The exponential operation can be defined as:
Where \(V^x_m\) and \(V^y_m\) are the smoothed values of the x and y components of the combined vector \(\vec {V}_m\) respectively at the \(m^{th}\) frame. Also, \(V'^x_m\) and \(V'^y_m\) are the raw input values of the x and y components respectively at the \(m^{th}\) frame. And \(V^x_{m-1}\) and \(V^y_{m-1}\) are the previous smoothened value. The smoothing factor \(\alpha \in (0,1)\) determines the contribution of current vs historical values. Higher values of \(\alpha\) emphasize recent data. The \(\alpha\) value is dynamically adjusted via a deep learning model i.e. higher values for fast activities (run and jump) and lower values for slow activities.
This makes the system generalized for any kind of motion. The magnitude of this resultant vector \(\Vert \textbf{V}\Vert = \sqrt{(V^x)^2 + (V^y)^2}\) is later utilized for repetition counting using correlation.
This figure shows different axes: x, y, 45pos and 45neg. The combined vector \(V\) is shown in red. Here, x_proj, y_proj, 45pos_proj, and 45neg_proj are the projections along the x-axis, y-axis, 45-pos axis, and 45-neg axis respectively.
(a) Signed magnitude plot of the vector Z for the sit-to-stand activity. (b) Autocorrelation R plot of the vector Z. (c) Detected peaks in the autocorrelation vector R, indicating periodic repetitions in the motion.
Motion direction
The repetitive motion can be along any direction. For example, for squats, the motion is along the vertical axis. Whereas, for inline lunges, the dominating motion is along the diagonal axis. To get the directional information, we take the projections along the four axes: x-axis, y-axis, 45pos-axis and 45neg-axis (see Fig. 5). Here, 45pos axis is the axis \(+45^{\circ }\) to the x-axis and 45neg is \(-45^{\circ }\) to the x-axis. The unit vectors along the 45pos and 45neg axis are given as:
The projection is calculated for the velocity vector \(\vec {V}_m\) for a certain number of frames M. To get the major axis along which the motion is happening, we take the variance of the projections for M frames along the x-axis, y-axis, 45pos-axis and 45neg-axis. The axis with the highest variance value is considered the major axis. For the list of vectors \(U = [\vec {V}_0, \vec {V}_1, \vec {V}_2, \vec {V}_3, \dots \vec {V}_{m}]\) for \(m\in [0,M]\). The variance can be calculated as:
Where, \(\mu = \frac{1}{M} \sum _{m=0}^{M-1} \left( \vec {V}_m \cdot \hat{a}_K \right)\) \(\vec {V}_m \cdot \hat{a}_K\) is the projection of the velocity vector \(\vec {V}_m\) onto a candidate axis \(\hat{a}_K\), \(\mu\) is the mean of the projections over \(M\) frames and \(\text {Var}(U)\) the variance of the projected values. Also,
For x-axis, y-axis, 45pos-axis and 45neg-axis. The axis with the maximum variance is considered as the major axis. For the major axis K, the immediate projection value for the \(m^{th}\) frame is \(\vec {V}_m \cdot \hat{a}_K\). Here, \(\hat{a}_K\) is the unit vector along the major axis \(K \in [x, y, 45pos, 45neg]\).
Now, the projection \(\vec {V}_m \cdot \hat{a}_K\) can be negative or positive. For a positive projection value, the magnitude \(\Vert \textbf{V}\Vert\) is multiplied by +1, and for a negative projection value, it is multiplied by -1.
Where \(Z_m\) is the signed magnitude (see Fig. 6) of the combined vector \(V_m\) for the \(m^{th}\) frame and \(\hat{a}_K\) is the unit direction along the major axis. Also,
Autocorrelation
To identify repetitions, we utilized auto-correlation using Pearson correlation on the 1d vector \(Z = [Z_0, Z_1, Z_2, Z_3, \dots Z_q] \in \mathbb {R}^{1\times Q}\) for \(q \in [0, Q]\). The auto-correlation formula can written as:
Where, \(Z_t\) is the value of the signal at time \(t\), and \(\bar{Z}\) is the mean of the signal over all time steps. The lag \(\ell\) represents the shift in time ranging from \(0\) to \(Q - 1\), where \(Q\) is the total number of samples). The Pearson autocorrelation coefficient at lag \(\ell\), denoted as \(\tilde{R}(\ell )\), measures the linear correlation between the original signal and its shifted version by \(\ell\) steps. \(R(\ell )\) is the normalized autocorrelation, and \(\max \limits _{\ell } \tilde{R}(\ell )\) denotes the maximum value over all lags of the unnormalized autocorrelation. For this study, we set the value \(\ell = 1\).
Peak detection
Peak detection is performed to count the number of repetitions in the last M frames. For a 1d vector \(R=[R(0), R(1), R(2), \dots R(Q-l-1)]\), the peak detection is performed using \(find\_peaks\) function of the scipy.signal library. For peak detection, the height threshold was set to 0.7, prominence to \(0.8\times max(R)\), distance to 10 and width to 2 (see Fig. 6). The output of this function gives the frame indexes where the peaks were detected. For continuous peak detection and to limit the computational analysis, the size of the signed magnitude vector Z is limited by dropping the older values. The peak detection algorithm is explained in Algorithm 1.
Repetition counting algorithm (algorithm-1)
Autocorrelation-based repetition counting algorithm
The recurring patterns are detected by analyzing the autocorrelation time series produced from the signed magnitude vector Z. The key concept used is the temporal periodicity inherent in the repetitive actions and this appears as a regular pattern of peaks in the autocorrelation signal R (see Fig. 6). In this method, the velocity vector Z generated from the skeletal body joints coordinates, applicable thresholds and frame counts are taken as inputs. An iterative temporal development of motion data is analyzed.
Initially, the algorithm examines if a sufficient number of frames have been processed (\(frame\_count>2\)) and guarantees that the system is not currently in a waiting state (i.e., \(wait\_idx\le 0\)). If the length of the vector Z exceeds a predetermined threshold, the first data point is eliminated to keep a fixed-size window. The primary signal-processing step is to compute the Pearson autocorrelation of the current motion distance series. This autocorrelation signal \(\tilde{R}\) is normalized by its highest value to provide R, which improves numerical stability and ensures consistent peak analysis across varied magnitudes of motion.
To minimize the impact of trailing data and noise, the last 20 entries of R are trimmed. A peak-finding algorithm is then used to identify the resultant autocorrelation signal’s peaks. If the current maximum value \(Z_q\) exceeds a predetermined threshold (indicating a meaningful point), the algorithm compares the number of detected peaks to the prior observed count. The first repetition (for \(repetition\_count == 0\)) is detected and \(repetition\_count\) is incremented when the deep learning model identifies the user-specified activity. If the number of peaks has grown but is still less than 3, a repeat is assumed, and the repetition count is raised. If there are three or more peaks, the algorithm enters a waiting state and records the index of the first peak. This method prevents over-counting when there are quick or overlapping repeats.
Finally, the previous peak count is updated while the frame count is incremented. The loop runs continuously, allowing for continuous monitoring.
Experiments
Dataset
We evaluated our repetition counting method with the public benchmark dataset named UI-PRMD and KIMORE along with our own custom dataset. This dataset includes the joint coordinates data of two motion-capturing systems, Vicon and Kinect.
UI-PRMD
This publicly available dataset includes data from 10 healthy participants. They performed 10 different rehabilitation exercises in both correct (10 times each) and incorrect manner (10 times each). There are studies like Hsu et al.43, which evaluated their model utilizing pairwise cosine similarity matrix method and the study by Abedi et al.44 which utilized sequential neural networks for repetition counting on UI-PRMD dataset. These studies have used the skeleton information recorded using Kinect V2 system. Whereas, we included the data recorded with the Vicon system, focusing only on the positional data. And, only the healthy individuals’ data was used for our study. The Vicon system is considered as the golden standard for joint position tracking in both clinical and sports settings45. The 10 exercises included are (m01) deep squats, (m02) hurdle step, (m03) inline lunge, (m04) side lunge, (m05) sit-to-stand, (m06) standing active straight leg raise, (m07) standing shoulder abduction, (m08) standing shoulder extension, (m09) standing shoulder internal-external rotation, and (m10) standing shoulder scaption.
KIMORE
This dataset comprises RGB-D Kinect-recorded video clips of participants performing various rehabilitation exercises. The participants are divided into two groups: a patient group, including individuals with pain and postural problems (such as Parkinson’s disease, back pain, and stroke), and a control group, consisting of both experts and non-experts. In total, the dataset includes 78 individuals, 34 patients with chronic motor deficits and 44 healthy participants (12 experts and 32 non-experts). The recorded exercises include: (E1) raising the arms, (E2) tilting the trunk laterally with extended arms, (E3) rotating the trunk, (E4) rotating the pelvis in the transverse plane, and (E5) squatting. The annotations for repetition counting were derived from the study by Abedi et al.44, where repetitions were manually counted for each subject across different exercises.
Evaluation metrics
For our study, we have used two evaluation metrics (1) Mean Absolute Error (MAE) and (2) Root Mean Squared Error (RMSE).
-
For regression tasks, the Mean Absolute Error (MAE) is a commonly used evaluation statistic. It calculates the average size of the deviations, in either direction, between the actual and projected values. By calculating the absolute difference, MAE offers a clear understanding of the average deviation between projections and true values.
$$\begin{aligned} \text {MAE} = \frac{1}{n} \sum _{i=1}^{n} |y_i - \hat{y}_i| \end{aligned}$$(19) -
Root Mean Squared Error (RMSE) is another frequently used evaluation metric. By taking the square root of the mean of the squared disparities between the predicted and actual values, it calculates the average magnitude of the prediction errors. Since squaring the mistakes increases their impact, RMSE is especially useful when greater errors need to be penalized more severely.
$$\begin{aligned} \text {RMSE} = \sqrt{\frac{1}{n} \sum _{i=1}^{n} (y_i - \hat{y}_i)^2} \end{aligned}$$(20)where, \(y_i\) is the predicted value and \(\hat{y}_i\) is the actual or true value. Here, n is the number of observations.
Implementation details
We conducted all the experiments including training the transformer model on an HP Z6 workstation with a Quadro RTX5000 graphics card, processor Intel Xeon CPU and 128 random access memory. We used Adam optimizer, categorical entropy loss and accuracy metric while training the transformer model. The batch size was set to 64 and the training was done for 20 epochs. We implemented our model using Python with PyTorch (v2.3.1), Tensoflow (v2.10.0), Numpy (v1.23) and SciPy (v1.9).
Results
(a). Training accuracy with window size N. (b) Test accuracy with window size N. (c) Inference time with respect to window size N.
This section provides a comprehensive summary of the results obtained from the conducted experiments.
Accuracy and loss plot with epochs.
Confusion matrix.
Activity recognition
For activity recognition, we utilized a transformer encoder-based deep learning model. Figure 7a and b show the accuracy plots for the training data and the test data with respect to different window sizes. The maximum accuracy was recorded for the window size \(N = 16\). Figure 7c shows the inference time with respect to the window size. The graph shows a little variation in the frame size between 10 to 18. From this, we kept the \(N=16\) as an optimal window size input for the deep learning model. The summary of the model architecture is shown in the Table 3. Figure 8 shows the accuracy and loss plots with respect to epochs. The confusion matrix shown in Fig. 9 highlights the superior classification performance of the transformer encoder-based deep learning model. We evaluated the model using different performance metrics like precision, recall and F1-score (see Table 4). From the table, we can see that the transformer model can achieve F1-score values ranging from 0.9736 to 0.996.
Despite being a lightweight model with only 309,108 trainable parameters, this transformer encoder-based deep-learning model has shown state-of-the-art performance. With an overall size of 1.18 MB, this model is best suited for activity recognition tasks for a low-cost and lightweight home rehabilitation system.
Table 4 shows the recall, precision, and F1 scores of all activities.
Repetition counting
Comparison on our custom dataset
We compared our approach for activity repetition counting with three other state-of-the-art models, one proposed by Dwibedi et al.36, the second proposed by Levy et al.37 and the last one by Sinha et al.39 using our custom dataset. Table 5 shows the performance of our approach on our custom RGB dataset. This table provides detailed information about the repetition counting results for each participant separately. Two evaluation parameters MAE and RMSE are used to assess the performance of repetition counting for each activity. Table 6 shows the state-of-the-art comparison results. We can see that our approach gave the best results for 8 exercises and second best for 2 exercises showcasing its robustness and effectiveness over a variety of activities. Hence, making it suitable for its use in the RGB camera-based home rehabilitation system. Overall, our approach has outperformed the other models with an overall MAE of 1.3182 and RMSE of 1.872. Whereas, the model proposed by Dwibedi et al.36 on the other hand obtained the second-best results.
Comparison on the benchmark dataset UI-PRMD
Our approach was tested on the publicly available benchmark dataset UI-PRMD. The detailed description of the performance of our approach on the UI-PRMD dataset is given in the Table 7. Table 8 shows the comparison of our approach with other state-of-the-art models. Out of 10 exercises, our approach has shown the best results in the 5 exercises and second-best results in the 4 exercises (for MAE). Overall, our model has achieved the best repetition counting results with the least MAE score value of 3.2733 and RMSE value of 4.1004. Hence, outperforms other models. Also, the model proposed by Sinha et al.39 has obtained the second-best result for the MAE score of 4.79. Our approach offers live counting while processing the video frames, giving an advantage over other state-of-the-art models.
Comparison on the benchmark dataset KIMORE
We also evaluated our model on KIMORE dataset. Table 9 shows the performance comparison of our model with other state-of-the-art models. Out of 5 exercises, our model has shown second-best results in 4 exercises and second-best results for MAE in the overall comparison. The model proposed by Dwibedi et al.36 has obtained the best overall results. Again, it is to be noted that this model is not a live counter and requires all the frames to be processed before giving the final repetiton count.
Left side shows the motion plot and auto-correlation plot with directional factor and the right side shows motion and autocorrelation plots without directional factor for 3 repetitions of the inline lunge activity.
Ablation study
We conducted a comprehensive ablation study to investigate the effect of various enhancements, including first iteration detection using the deep learning (DL) model, calculation of variance on four axes (\(x, \text {}y,\text {}45pos\text { and }45neg\)), calculation of variance on two axes (\(x\text { and }y\)) and directional factor (+ve or -ve) obtained \((\text {sign}(\vec {V}_m \cdot \hat{a}_K))\). The results are shown in Table 10. Our experiments proved that the incorporation of a directional factor significantly enhances the performance approximately tenfold. This directional factor effectively prevents the motion from being counted twice by distinguishing forward and backward motion, a critical consideration in repetitive tasks. Figure 10 demonstrates the results of the inclusion and exclusion of the directional factor. From the graphs it can clearly be seen that the backward motion has been misinterpreted as forward motion, resulting in positive peaks. Hence, giving two identical peaks for a single repetition results in double counting. For repetition detection, at least two iterations are required to be performed, hence the deep learning model plays a vital role in counting the first iteration of any activity. Also, calculating the variance across four axes enhances the performance as repeating motion can occur along any arbitrary direction in the spatial domain.
Optimal parameters
For deep learning model
For activity recognition, we have optimized the deep learning model architecture for enhanced accuracy. For model training, we kept the window size \(N=16\), stride value \(u=1\), number of heads in the multi-head self-attention \(H=4\), depth 1 and feed-forward network dimension = 64. The model was trained with a batch size of 64 for 20 epochs with a 0.1 dropout rate for dense layers.
For repetition counter
We fine-tuned our model for improved performance across different datasets. For data filtering on our dataset, the smoothing factor \(\alpha\) value was set to 0.85 for fast activities (run and jump) and 0.08 for the rest of the activities. For variance calculation, the optimal value was found \(M=100\). For autocorrelation, the number of total frames Q was set to 800. At last, for peak detection, the dynamic prominence ratio was set to 0.8, the height threshold to 0.8, the minimum distance to 10 and the width to 2. However, for UI-PRMD, the \(\alpha\) threshold was set to 0.1, the dynamic threshold ratio to 0.4 and the height threshold for peak detection to 0.4. Similarly, for KIMORE dataset, \(\alpha\) was set to 0.01, the dynamic threshold ratio to 0.4, and the height threshold for peak detection to 0.4. Table 11 shows the variation of results for different values of \(\alpha\).
Computational analysis
We ran our system and measured the execution time for different operations, including: (1) joint extraction using mediapipe, (2) activity detection using the DL model, (3) autocorrelation, (4) peak detection, (5) other remaining tasks for sit-to-stand activity. Figure 11 illustrates the mean execution time for a single frame, distributed across various operations. From the results, we can see that the system can process approximately 9 frames per second which is fair for offline video processing. However, the system can be optimized to run faster by skipping frames which might compromise the accuracy.
We ran other state-of-the-art models as well on the same system (refer section 4) for a fair comparison. For Sinha et al.39 the average processing time for a single frame was found to be 0.153 seconds. For Dwibedi et al.36 was 0.066 seconds which is fast but it is not a live counter and processes the whole video before giving the output. For Levy et al.37 the average frame execution time was 0.019 seconds and had the fastest execution time but the performance was low.
Additionally, after training, we deployed our system to a low computational laptop (Toshiba satellite) equipped with 12 GB RAM, Intel HD Graphics and Intel Core I5 (4th generation) processor for computational time analysis. We again measured the execution time for different operations. The joint extraction using mediapipe took 0.059 sec, activity detection using the DL model took 0.031 sec, autocorrelation took 0.006, peak detection took 0.009, and other remaining tasks took 0.017 for sit-to-stand activity. The average processing time for a single frame was found to be 0.122 sec which is approximately 8 frames per second.
This shows the mean execution time for a single frame distributed across various tasks.
Real time evaluation
We tested our system in real-time on the author participant. From the experiments we concluded that our system works well for slow activities in real-time. Hence, we tested our model for exercises inline-lunge, squats, standing shoulder scaption, standing shoulder extension, standing shoulder internal external rotation, standing shoulder abduction, and sit-to-stand with 10 repetitions for each exercise. Out of 10 our system counted exactly 10 for inline lunge, squats, standing shoulder scaption, standing shoulder extension, standing shoulder internal external rotation and standing shoulder abduction. Only for sit-to-stand it counted 11 repetitions with overall MAE score value 1.429 and RMSE score value of 1.429. It is to be noted that for real-time analysis we ran our model on a low computational laptop (Toshiba satellite) equipped with 12 GB RAM, Intel HD Graphics and Intel Core I5 (4th generation) processor. Here, the laptop’s webcam was used for live feed. Hence, proving its feasibility for real-time home use on low end devices as well.
Discussion
This study proposes a live repetition counting algorithm that does not require prior training. The experimental results demonstrate the superiority, efficacy and robustness of the proposed system over a wide range of activities. And our preliminary study has demonstrated the feasibility of our model for real-time scenario as well. The comparative analysis has further highlighted the strength of our system.
The system incorporates activity recognition to identify the desired activity preset by the user. The integration of the directional factor mitigates double counting due to the misinterpretation of the reverse cycle as forward. The variance calculation enhances the performance by allowing the system to detect the repetition along any arbitrary axis in the space.
Our approach gave the best results on the publicly available UI-PRMD dataset and gave second-best results on KIMORE dataset. Unlike our system, which requires coordinates which were extracted using Mediapipe, the other state-of-the-art models work on the videos only. The coordinates in UI-PRMD and KIMORE dataset for each exercise were first converted to images and then compiled to videos (see Fig. 12), which were later used as inputs to other state-of-arts. Meanwhile, the same joint coordinates were directly provided to our system during the evaluation for a fair comparison.
This shows the visualization of benchmark datasets a). shows the exercises of KIMORE dataset, b). shows the exercises of UI-PRMD dataset.
Limitations
Despite superior performance across different metrics, our approach has potential limitations. First, the system has been tested in a controlled lab environment with good lighting conditions and camera quality. Our system heavily relies on the mediapipe framework for joint coordinates extraction which has been trained and tested under varied lighting and noise conditions, still, testing of the whole system under diverse conditions is required. Second, the proposed system is a live counter and the testing has been done on pre-recorded videos and only preliminary study has been done in real-time. An extensive real-time study with more participants is still required. Third, testing with real patients is yet to be done. Furthermore, the proposed system is training independent for repetition counting, for activity recognition training is required and the smoothing factor \(\alpha\) value changes accordingly. Higher values for fast activities (jump and run) and lower values for slow activities. This value is dynamically handled by the deep learning model. However, the model can be trained to estimate the period length, thereby adjusting the hyper-parameters for dynamic performance.
Future scope
Our future work is more oriented towards the collection of diverse data in different lighting and environmental conditions including both healthy individuals and patients. We aim to develop a full end-to-end system to be deployed on devices such as laptops, smartphones, or tablets. Further improvements may involve dynamic adjusting hyper parameters without relying on deep learning models, hence making the system fully training-independent.
Conclusion
In the current study, we have built a pipeline to classify different human activities and live counting of the repetitions on pre-recorded videos and real-time feed from the webcam. The proposed system utilizes a lightweight transformer-based model for activity recognition and auto-correlation-based training independent general repetition counter. Hence, making it suitable for home rehabilitation. The proposed system works on RGB data, therefore relying on minimal hardware (a camera and a PC/laptop). The system has shown excellent performance on the different datasets, proving its generalizability and robustness compared to other state-of-the-art models.
Data availability
Access to data may be requested from the corresponding author.
Code availability
References
Warburton, D. E., Nicol, C. W. & Bredin, S. S. Health benefits of physical activity: The evidence. CMAJ 174(6), 801–809 (2006).
Ka, S. Barriers and motivations to exercise in older adults. Prevent. Med. 39(5), 1056–1061 (2004).
Müller, M., Toussaint, R. & Kohlmann, T. Total hip and knee arthroplasty: Results of outpatient orthopedic rehabilitation. Orthopade 44, 203–211 (2015).
Deutsche Rentenversicherung Bund. Reha-Therapiestandards Hüft- und Knie-TEP. Retrieved from https://www.deutsche-rentenversicherung.de/SharedDocs/Downloads/DE/Experten/infos_reha_einrichtungen/quali_rehatherapiestandards/TEP/rts_tep_download.pdf (2020).
Ritter, S., Dannenmaier, J., Jankowiak, S., Kaluscha, R. & Krischak, G. Implantation einer Hüft-oder Knietotalendoprothese und die Inanspruchnahme einer Anschlussrehabilitation. Die Rehabil. 57(04), 248–255 (2018).
Baulig, C., Grams, M., Rohrig, B., Linck-Eleftheriadis, S. & Krummenauer, F. Clinical outcome and cost effectiveness of inpatient rehabilitation after total hip and knee arthroplasty. A multi-centre cohort benchmarking study between nine rehabilitation departments in Rhineland-Palatinate (Western Germany). Eur. J. Phys. Rehabil. Med. 51(6), 803–813 (2015).
Müller, E., Mittag, O., Gülich, M., Uhlmann, A. & Jäckel, W. H. Systematic literature analysis on therapies applied in rehabilitation of hip and knee arthroplasty: Methods, results and challenges. Die Rehabil. 48(2), 62–72 (2009).
Tuncel, T., Simon, S. & Peters, K. M. Flexible rehabilitation times after total hip and knee replacement. Orthopade 44, 465–473 (2015).
Khan, F., Ng, L., Gonzalez, S., Hale, T. & Turner-Stokes, L. Multidisciplinary rehabilitation programmes following joint replacement at the hip and knee in chronic arthropathy. Cochrane Datab. Syst. Rev. (2) (2008).
Ottenbacher, K. J. & Graham, J. E. The state-of-the-science: Access to postacute care rehabilitation services. A review. Arch. Phys. Med. Rehabil. 88(11), 1513–1521 (2007).
Sibold, M. et al. Predictors of participation in medical rehabilitation follow-up in working patients with chronic back pain. Die Rehabil. 50(6), 363–371 (2011).
Moffet, H. et al. In-home telerehabilitation compared with face-to-face rehabilitation after total knee arthroplasty: A noninferiority randomized controlled trial. JBJS 97(14), 1129–1141 (2015).
Tousignant, M. et al. A randomized controlled trial of home telerehabilitation for post-knee arthroplasty. J. Telemed. Telecare 17(4), 195–198 (2011).
Zerpa, C., Lees, C., Patel, P., Pryzsucha, E. & Patel, P. The use of microsoft kinect for human movement analysis. Int. J. Sports Sci. 5(4), 120–127 (2015).
Clark, R. A. et al. Validity of the Microsoft Kinect for assessment of postural control. Gait Posture 36(3), 372–377 (2012).
Huang, J. D. Kinerehab: A kinect-based system for physical rehabilitation: A pilot study for young adults with motor disabilities. In The proceedings of the 13th international ACM SIGACCESS conference on Computers and accessibility 319–320 (2011).
Chen, C., Liu, K., Jafari, R. & Kehtarnavaz, N. Home-based senior fitness test measurement system using collaborative inertial and depth sensors. In 2014 36th Annual International Conference of the IEEE Engineering in Medicine and Biology Society 4135–4138 (IEEE, 2014).
Dibben, G. et al. Exercise?based cardiac rehabilitation for coronary heart disease. Cochrane Datab. Syst. Rev. (11) (2021).
Ferreira, R., Santos, R. & Sousa, A. Usage of auxiliary systems and artificial intelligence in home-based rehabilitation: A review. Exploring the Convergence of Computer and Medical Science Through Cloud Healthcare, 163-196 (2023).
Fernandez-Cervantes, V., Neubauer, N., Hunter, B., Stroulia, E. & Liu, L. VirtualGym: A kinect-based system for seniors exercising at home. Entertain. Comput. 27, 60–72 (2018).
Sangani, S., Patterson, K. K., Fung, J. & Lamontagne, A. Real-time avatar-based feedback to enhance the symmetry of spatiotemporal parameters after stroke: Instantaneous effects of different avatar views. IEEE Trans. Neural Syst. Rehabil. Eng. 28(4), 878–887 (2020).
Liao, Yalin, Vakanski, Aleksandar, Xian, Min, Paul, David & Baker, Russell. A review of computational approaches for evaluation of rehabilitation exercises. Comput. Biol. Med. 119, 103687 (2020).
Chander, A., Bathula, D. R. & Sahani, A. K. Towards an RGB camera-based rehabilitation exercise assessment using an enhanced spatio-temporal transformer framework. Multimed. Tools Appl. 1-25 (2025).
Lugaresi, C., Tang, J., Nash, H., McClanahan, C., Uboweja, E., Hays, M., Zhang, F., Chang, C. L., Yong, M., Lee, J. & Chang, W. T. Mediapipe: A framework for perceiving and processing reality. In Third workshop on computer vision for AR/VR at IEEE computer vision and pattern recognition (CVPR) (Vol. 2019) (2019).
Vakanski, A., Jun, H. P., Paul, D. & Baker, R. A data set of human body movements for physical rehabilitation exercises. Data 3(1), 2 (2018).
Capecci, M. et al. The kimore dataset: Kinematic assessment of movement and clinical scores for remote monitoring of physical rehabilitation. IEEE Trans. Neural Syst. Rehabil. Eng. 27(7), 1436–1448 (2019).
Mukhopadhyay, S. C. Wearable sensors for human activity monitoring: A review. IEEE Sens. J. 15(3), 1321–1330 (2014).
Lara, O. D. & Labrador, M. A. A survey on human activity recognition using wearable sensors. IEEE Commun. Surv. Tutor. 15(3), 1192–1209 (2012).
Soro, A., Brunner, G., Tanner, S. & Wattenhofer, R. Recognition and repetition counting for complex physical exercises with deep learning. Sensors 19(3), 714 (2019).
Cutler, R. & Davis, L. S. Robust real-time periodic motion detection, analysis, and applications. IEEE Trans. Pattern Anal. Mach. Intell. 22(8), 781–796 (2000).
Alatiah, T. & Chen, C. Recognizing exercises and counting repetitions in real time. Preprint at arXiv:2005.03194 (2020).
Stoica, P. & Moses, R. L. Spectral analysis of signals Vol. 452, 25–26 (Prentice Hall, 2005).
Vlachos, M., Yu, P. & Castelli, V. On periodicity detection and structural periodic similarity. In Proceedings of the 2005 SIAM International Conference on Data Mining 449–460 (Society for Industrial and Applied Mathematics, 2005).
Runia, T. F., Snoek, C. G. & Smeulders, A. W. Real-world repetition estimation by div, grad and curl. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition 9009–9017 (2018).
Zhang, H., Xu, X., Han, G. & He, S. Context-aware and scale-insensitive temporal repetition counting. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition 670–678 (2020).
Dwibedi, D., Aytar, Y., Tompson, J., Sermanet, P. & Zisserman, A. Counting out time: Class agnostic video repetition counting in the wild. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition 10387–10396 (2020).
Levy, O. & Wolf, L. Live repetition counting. In Proceedings of the IEEE International Conference on Computer Vision 3020–3028 (2015).
Li, X., Singh, V., Wu, Y., Kirchberg, K., Duncan, J. & Kapoor, A. Repetitive motion estimation network: Recover cardiac and respiratory signal from thoracic imaging. Preprint at arXiv:1811.03343 (2018).
Sinha, S., Stergiou, A. & Damen, D. Every shot counts: Using exemplars for repetition counting in videos. In Proceedings of the Asian Conference on Computer Vision 3056–3073 (2024).
Liao, Y., Vakanski, A. & Xian, M. A deep learning framework for assessing physical rehabilitation exercises. IEEE Trans. Neural Syst. Rehabil. Eng. 28(2), 468–477 (2020).
Garg, S., Saxena, A. & Gupta, R. Yoga pose classification: A CNN and MediaPipe inspired deep learning approach for real-world application. J. Ambient. Intell. Humaniz. Comput. 14(12), 16551–16562 (2023).
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł. & Polosukhin, I. Attention is all you need. Adv. Neural Inf. Process. Syst.30 (2017).
Hsu, Y. C., Efstratios, T. & Tsui, K. L. Viewpoint-invariant exercise repetition counting. Health Inf Sci Syst. 12(1), 1. https://doi.org/10.1007/s13755-023-00258-3 (2023).
Abedi, A. et al. Rehabilitation Exercise Repetition Segmentation and Counting Using Skeletal Body Joints. In 2023 20th Conference on Robots and Vision (CRV), Montreal, QC, Canada 288–295. https://doi.org/10.1109/CRV60082.2023.00044 (2023).
Lee, N., Ahn, J. & Lim, W. Concurrent and angle-trajectory validity and intra-trial reliability of a novel multi-view image-based motion analysis system. J. Hum. Kinet. 86, 31–40. https://doi.org/10.5114/jhk/159587 (2023).
Author information
Authors and Affiliations
Contributions
A.P.C wrote the main manuscript, identified the problem statement, prepared figures/tables/graphs, written the code, proposed the novel approaches and contributed in the data collection. C.G contributed in data collection. A.K.S contibuted in designing of the problem statement.
Corresponding author
Ethics declarations
Competing interests
The authors declare no competing interests.
Ethical approval
Ethics approval was received from the local Institutional Ethics Committee (Humans) at Indian Institute of Technology Ropar file no. IITRPR/IEC/2023/025 dated 1st March, 2024.
Consent to participate
Informed consent was obtained from all participants.
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.
About this article
Cite this article
Chander, A., Singhal, C. & Sahani, A.K. Towards an RGB camera-based live repetition counter using auto correlation with action recognition for home rehabilitation. Sci Rep 15, 41799 (2025). https://doi.org/10.1038/s41598-025-25674-1
Received:
Accepted:
Published:
Version of record:
DOI: https://doi.org/10.1038/s41598-025-25674-1















