Introduction

The impact of COVID-19 has resulted in the deaths of millions of people globally. Both heart disease and stroke are serious public health issues; the former impairs mobility and is a leading contributor to disability. Mental health issues are becoming more prevalent; millions of individuals globally suffer from depression. In today’s environment, obesity, sedentary lifestyles, and poor nutrition are the primary causes of many health problems1.

Furthermore, there is no doubt about the impact that computers and computer-powered technologies have on healthcare and related fields, as well as every other field. Yoga, Zumba, martial arts, and other pastimes are commonly recognised as means to improve one’s health, as are routine medical procedures. With origins in ancient India, yoga is a broad category of activities aimed at improving a person’s physical, mental, and spiritual well-being2.

Integrating technology into yoga practices can benefit from the use of artificial intelligence tools such as Pose-Net and Mobile-Net SSD, as well as human posture detection. In the field of Human Computer Interaction (HCI), identifying the human body presents a big issue3,4. It is frequently used for a variety of purposes, including as daily tasks, yoga, sports, and more. A key topic in computer vision is human posture estimation, which has applications in behaviour analysis, intelligent driver assistance systems, assisted living5and visual surveillance. Since the emergence of deep neural networks, pose estimation performance has significantly improved. Computer vision technology is employed to standardize and correct yoga postures. It’s important to perform yoga poses correctly to avoid injuries and long-term complications6. Studying human posture can help detect and address abnormal positions, enhancing overall well-being at home7.

Despite the relatively small number of qualified professional yoga instructors, yoga is a popular physical exercise with a big global following. Self-study, such as mechanically mimicking yoga motions from instructional films, is the only way for most yoga beginners to learn the practice. Technically speaking, the learner finds it difficult to precisely notice the minute features of their full body posture with the existing method. This is due to the fact that many poses require the student to direct their sight in a particular direction. As a result, this restriction causes the learning process to become less effective. As a result, identifying and assessing yoga poses is essential for offering direction for independent study.

Developing and implementing a Shankhaprakshalana yoga position detection system that can efficiently identify and track yoga postures is the main goal of this research work. Our goal is to offer a novel framework that incorporates posture features to accomplish accurate and robust yoga pose detection by thoroughly analysing existing literature and approaches using a variety of machine learning (ML) algorithms. The significance of this work lies in its ability to link technology with yoga practice.

Related works

In the field of pose detection8,9 and computer vision, recent research efforts have contributed significantly to the advancement of methodologies and models10,11. Table  1 provides a summary of reviewed works relevant to the current study. Poselet-conditioned pictorial structures have demonstrated enhanced precision in human pose estimation12 while the integration of YOLO V4 has proven effective for object detection, particularly in discerning yoga postures with complex spatial relationships13. Multi-person yoga pose estimation has seen notable progress14,15,16,17 with the utilization of part affinity fields18. The work of19 introduces a novel approach to pose estimation from image sensor data, showcasing advancements in real-time human pose recognition. Furthermore, robust 3D pose has presented estimation techniques20, highlighting the potential for accurate spatial understanding. Pose recognition from depth images has been addressed by21, showcasing the feasibility of real-time recognition in dynamic yoga environments. The exploration of articulated pose estimation models and convolutional pose networks22,23 has contributed to the evolution of 3D human pose estimation methods. Temporal convolutional networks24 have been pivotal in advancing real-time yoga recognition by capturing spatial-temporal features. The integration of improved YOLO-V3 models has demonstrated success in object detection applications within agriculture and surveillance contexts25.

Table 1 Reviewed Works in Yoga Pose Recognition

Research gap & motivation

Therapeutic technologies improve mobility by reducing impairments at the body structure/function level, aiding the body in repairing or addressing structural impairments, and supporting rehabilitation of impaired body function. As opposed to therapeutic technologies, assistive technologies are intended for use in the home and community to facilitate the execution of functional tasks and are operated by the user rather than a clinician26. As a result, the emphasis of this study is the prediction of shankha prakshalana yoga poses that will assist the individual in resolving bowel movement issues. Shankha Prakshalana is a yoga asana that is renowned for its numerous health advantages. However, achieving mastery and proficiency in executing the pose with precision necessitates appropriate guidance and adherence to proper form. An automated system for recognising yoga poses can serve as a valuable tool in delivering real-time feedback and ensuring the safety of practitioners.

Novelty and scope

The objective of this study is to create a comprehensive framework for pose estimation utilising artificial intelligence methods to assist individuals in executing Shankha Prakshalana kriya with accurate posture. The main goal is to mitigate injuries and enhance the quality of human exercise by utilising a computer and camera system. The proposed system employs an innovative methodology for posture detection and correction utilising convolutional neural network models. The system’s performance is assessed using both supervised and unsupervised learning techniques to determine the most superior optimal model for the Shankha Prakhshalna yogic kriya.

Significant contribution and outline

The noteworthy contributions of the research presented in this work are enumerated as follows.

  • A comprehensive dataset is proposed on Shankha Prakhhshlana kriya, specifically designed for the purpose of yoga pose recognition. The dataset comprises a total of 77 RGB videos, which have been categorised into 11 distinct groups.

  • A customised pipeline utilising supervised ML methodology is proposed, which incorporates a Random Forest classifier. This integration aims to enhance the precision and resilience of yoga pose detection.

  • A comparison is made between the proposed pipeline and various supervised learning algorithms, including k-NN (k-Nearest Neighbour), SVM (State vector Machines), and different clustering techniques.

  • The proposed dataset is being applied to both supervised and unsupervised ML algorithms to determine the optimal performance model for Shankha Pralakshna Kriya.

Section 2 provides a comprehensive analysis of the proposed Yoga Pose Recognition System. Section 3 presents the results obtained from the conducted experiments. The findings are subjected to analysis in Section 4. Section 5 provides a comprehensive summary of the conclusion.

Proposed yoga pose recognition system

The acquisition of the yoga pose detection dataset and the extraction of pose features are described in this section. The pipeline to generate a recognition system to detect Shankha Prakhshalana kriya yoga poses is depicted in Fig. 1.

Figure 1
figure 1

The pipeline is designed to produce a recognition system that can identify the yoga poses known as Shankha Prakhshalana Kriya.

Yoga pose dataset collection and preprocessing

The proposed dataset for yoga pose detection was meticulously collected to ensure diversity and representatives. Seven 15-second RGB videos were recorded for each of the 11 yoga poses relevant to the “Shankha Prakshalana” Kriya. This resulted in a total of 77 videos in the dataset (77 video.mov). To maintain consistency and standardization, all videos were recorded with a smartphone. The recorded videos were then processed using FFmpeg to convert them into uncompressed MKV format (77 video.mkv). Each video has a resolution of 1080x720 pixels. Samples of the different classes are presented in Fig.  2.

Figure 2
figure 2

Proposed the yoga pose for Shanka Prashalana kriya dataset.

Data augmentation and feature extraction

The videos undergo an initial augmentation process using the vidaug library, which is a Python package designed to enhance videos for deep learning systems. The process converts the received videos into a novel and significantly larger compilation of slightly modified videos. Preprocessing techniques such as rotation, variable tilting, and frame blurring are employed to enhance the quality of videos. One of the notable improvements that could be beneficial for gesture recognition is the implementation of Gaussian blur. Gaussian blur is a widely used technique to mitigate visual noise. It is a type of image-blurring filter that utilises a Gaussian function to determine the appropriate change to be applied to each pixel in the image. (b) Local Contrast Normalisation: This standardisation technique implements local competition between neighbouring features in a feature map and between features in the same spatial location across multiple feature maps.

RGB video data is utilised as input for the purpose of extracting skeletal features. The MediaPipe Hands feature developed by27 utilises skeletal characteristics to estimate hand postures. The software is a freely available framework designed for building pipelines that handle sequential data, specifically video and audio. High-fidelity hand tracking is accomplished through the utilisation of ML algorithms to predict 21 3D key points of a hand from a single image. The keypoints are stored in the numpy format for convenient input into the model. The pseudo code for pose extraction and augmentation, as shown in Algorithm 1, provides a concise explanation of this process.

Algorithm 1
figure a

Pseudo code for Pose Extraction and Data Augmentation

Proposed methodology

In this section, we compare different ML techniques using a proposed dataset for both supervised and unsupervised learning. We evaluate the performance of these techniques using various evaluation metrics. The ML techniques are K-NN , SVM , and RF (Random Forest). Deep learning techniques, specifically neural network-based algorithms, have gained significant popularity in recent years. However, they have been observed to exhibit overfitting issues when applied to the Shankapralashna dataset. While accuracy is widely recognised as the primary performance metric in activity recognition, we also incorporated supplementary metrics like precision, recall, and F1 score28.

K-nearest neighbour

The k-nearest neighbours (k-NN) algorithm classifies a given data sample x using a pre-existing training set X. It calculates the distance between each point and the prediction point whose prediction we need to make. The algorithm chooses the label that has the highest weight among the k closest training samples. The application of a distance-based weighting technique can be employed to amplify the influence of neighbouring points on the final label prediction29.

State vector machines

This algorithm is used to generate one or more decision boundaries in the n-dimensional input feature space. It ensures that the distance to the closest samples of each label is maximal. This necessitates the data to exhibit linear separability. If the dataset exhibits non-linear separability, it is possible to transform the training data into a higher-dimensional space with N dimensions (N>n) and identify an optimal hyperplane within that space. Nevertheless, this particular projection can incur significant computational costs. The SVM algorithm employs a kernel trick to mitigate this issue. Using a kernel function instead of directly projecting the data points into a higher-dimensional space allows one to find an ideal decision boundary. In the N-dimensional space, this kernel function characterises the dot-product of the data points.

Random forest

The random forest (RF) algorithm is a ML algorithm in the field of ensemble learning. Therefore, multiple ML models, specifically decision trees, are utilised to forecast the labels of novel input data. The final prediction of the Random Forest (RF) is determined by the majority label of the predictions made by the weak classifiers. Furthermore, the process of random feature selection or subsampling is executed during the training phase. Hence, the training process for each decision tree involves selecting a subset of input features. This approach aims to reduce the correlation between decision trees and enhance their ability to generalise. Moreover, it is possible to enhance the performance of each weak classifier by training it on a single subset of randomly selected samples [68]. This methodology is commonly referred to as bootstrapping.

Results

In this section we describe proposed network training and evaluates its performance.

Experimental setup

The experimental setup was performed on a local machine called DESKTOP-OHD4ICG, which is equipped with an Intel(R) Core(TM) i7-9750H CPU running at a frequency of 2.60GHz (2.59 GHz) and 16.0 GB of installed RAM (15.9 GB usable). The system functions on a 64-bit operating system that utilises a processor based on the x64 architecture.

In order to conduct our experimentation, we have devised a ML model with the objective of detecting and classifying yoga poses. The system utilised a combination of supervised and unsupervised learning techniques. Furthermore, we implemented data augmentation methodologies to enhance the quality and depth of our dataset. Once the dataset was divided into training and testing subsets, the dimensions obtained were (3548, 15, 99) for the input data and (3548, 12) for the corresponding labels in the training set. In the testing set, the dimensions were (887, 15, 99) for the input data and (887, 12) for the corresponding labels. The dataset consisted of 11 unique categories of yoga poses.

During the course of the experimentation, our primary objective was to attain a high level of reliability in the task of pose identification. This study employed a combination of supervised and unsupervised learning methods, as well as data augmentation, to improve the model’s capacity to generalise across different yoga poses. The experimental framework was developed to enable comprehensive assessment and verification of the model’s effectiveness in accurately identifying yoga poses.

Performance analysis for supervised learning

The comparison between different supervised learning algorithms over the proposed dataset is depicted in Table 2 This study aimed to assess the effectiveness of three different classification algorithms, namely k-NN, SVM, and RF, on video data. The video data consisted of different sequence lengths, specifically 5, 10, 15, 20, 25, and 30 frames. The algorithms were chosen based on their various mechanisms for handling classification tasks. kNN is a straightforward, instance-based learner, SVM is known for its ability to handle high-dimensional spaces, and RF is recognised for its ensemble method, which provides excellent precision and versatility. We utilised various sequence lengths to ascertain the most suitable temporal resolution for each algorithm, thereby improving our comprehension of how the level of detail in video data affects the performance of classification.

Table 2 Comparison of supervised learning techniques on the proposed dataset for different values of sequence length

The findings demonstrate that RF consistently exhibited superior performance compared to k-NN and SVM across all sequence lengths with accuracy of 99.66% when applied to sequence lengths of 15 and 20 frames. In the context of sequence length 5, the accuracies achieved by kNN, SVM, and RF were 98.6%, 98.9%, and 99.2%, respectively. The accuracies achieved at sequence length 10 were 98.4%, 98.3%, and 98.6%. Both k-NN and SVM achieved a performance of 99.2% at a sequence length of 15. In contrast, random forest (RF) reached its highest performance of 99.6%. At a sequence length of 20, the k-NN and SVM algorithms achieved a performance of 98.9% and 98.6%, respectively. RF achieved accuracies of 99.4% and 99.3% for sequence lengths 25 and 30, respectively. In contrast, k-NN and SVM exhibited slightly lower performances.

The performance of RF is represented in the Fig.  3 which is due to its ensemble learning methodology, which improves the accuracy and resilience of predictions, its capacity to efficiently process a substantial quantity of input features, and its ability to withstand noise and variability in video data. On the other hand, the instance-based approach of kNN and the requirement for extensive parameter tuning in SVM are likely factors that contributed to their relatively lower performance. The results of this study indicate that Random Forest (RF) is the most efficient algorithm for video classification tasks. It exhibits strong performance and accuracy when applied to different temporal resolutions.

Figure 3
figure 3

Graphical representation of the various supervised learning algorithms on the proposed dataset for different sequence lengths.

The confusion matrix presented in Fig.  4 showcases the remarkable performance of the multi-class classification model when applied to 11 different classes. Each row in the dataset corresponds to the true class, whereas each column corresponds to the predicted class. The diagonal components of the matrix represent the count of accurate predictions for each class, indicating that the model has attained a high level of accuracy, as most of the predictions are concentrated on the diagonal. The model successfully classified 81 instances of class 0, 82 instances of class 1, 91 instances of class 2, 88 instances of class 3, 73 instances of class 4, 80 instances of class 5, 77 instances of class 6, 76 instances of class 7, 70 instances of class 8, 75 instances of class 9, and 91 instances of class 10. The occurrence of misclassifications is negligible, as there is only one instance in class 4 that is misclassified as class 3, one instance in class 6 that is misclassified as class 5, and one instance in class 7 that is misclassified as class 6. This yields a total accuracy of 100%, demonstrating the model’s resilience and dependability in precisely categorising the instances. The exceptional classification performance showcases the model’s efficacy and its potential suitability in real-world situations where precise accuracy is of utmost importance. The obtained results demonstrate the model’s ability to effectively manage intricate multi-class classification tasks while minimising errors. This makes it a valuable asset for future research and practical implementation.

Figure 4
figure 4

Confusion Matrix of the Random Forest Classifier.

Fig.  5 denotes the different evaluation metrics for the Random Forest classifier algorithm. The x-axis denotes the class labels, representing the categories in the classification task, with values ranging from 1 to 11. Three sets of bars are generated for each class: the first bar representing precision, the second bar representing recall, and the third bar representing the F1-score. The numerical values of each metric span from 0 to 1, and the vertical position of the bars corresponds to the corresponding values. Classifications that exhibit bars in close proximity to 1 for all three metrics demonstrate exceptional performance. The presence of slight discrepancies among the bars can indicate discrepancies in the model’s performance, such as a lower precision for class 5 in comparison to its recall and F1-score. This suggests that there may be occasional instances of false positive predictions. In general, the graph offers a concise and visual representation of the model’s performance across various classes, facilitating the identification of both its strengths and areas requiring enhancement.

$$\begin{aligned} & \text {Precision} = \frac{TP}{TP + FP} \end{aligned}$$
(1)
$$\begin{aligned} & \text {Recall} = \frac{TP}{TP + FP} \end{aligned}$$
(2)
$$\begin{aligned} & \text {F1-score} = 2 \times \frac{\text {Precision} \times \text {Recall}}{\text {Precision + Recall}} \end{aligned}$$
(3)
Figure 5
figure 5

Bargraph depicting the Recall, Precision and F1-score evaluation metric values for Random forest classifier model.

Performance analysis for unsupervised learning

In this section, we chose to utilise three different clustering algorithms in order to analyse video data as depicted in Table  3. These algorithms include Agglomerative Clustering, Gaussian Mixture Model (GMM), and K-Means clustering. The justification for this decision is based on the variety of clustering approaches employed by these algorithms, each providing distinct viewpoints and methodologies for grouping data points that are similar. Agglomerative Clustering is a method that iteratively combines the closest data points to create clusters. On the other hand, GMM models the data as a combination of Gaussian distributions. Lastly, K-Means partitions the data into k clusters based on the proximity of centroid points. Through the utilisation of these three algorithms, our objective was to thoroughly investigate the clustering structure of the video data and evaluate their efficacy in revealing significant patterns.

Table 3 Analysis of unsupervised learning methods for different clustering algorithms on the given dataset over different sequence lengths

Additionally, we manipulated the duration of the video sequences, specifically 5, 10, 15, 20, 25, and 30 frames, throughout our analysis. This intentional variation was implemented to examine the impact of video data temporal granularity on the performance of each clustering algorithm. Diverse sequence lengths are employed to capture different degrees of temporal information. Through the analysis of clustering outcomes across these lengths, our objective was to determine the most suitable temporal resolution for each algorithm.

The analysis yielded significant findings regarding the clustering performance of the three algorithms across various sequence lengths represented in Fig.  6. The K-Means clustering algorithm consistently demonstrated superior performance compared to Agglomerative Clustering and GMM. It achieved higher silhouette scores, which are a quality measure for clustering, for each sequence length. The observed consistency highlights the resilience and efficiency of K-Means clustering in dividing the video data into interconnected and clearly defined clusters. Significantly, the K-Means algorithm demonstrated its maximum silhouette score of 0.2888 when the sequence length was configured to 5 frames. This suggests that shorter sequences may yield superior clustering outcomes for this algorithm.

Figure 6
figure 6

Graph depicting the various clustering techniques on the proposed dataset for varying sequence lengths.

To the best of my knowledge, there has been no effort made to categorise the poses of shanka prakshalna yoga. Therefore, I was unable to locate a work that is identical to mine, but I did compare it with works that are similar to ours. Based on the data presented in the Table 4, it is evident that the proposed work is achieving the highest level of accuracy compared to all other state-of-the-art works.

Table 4 Comparison with other State of the art works

Discussions

Supervised learning

The supervised learning model consistently achieves high accuracy scores, ranging from 0.9842 to 0.9966, which indicate consistent performance across various sequence lengths. In order to achieve the highest accuracy scores across all sequence lengths, Random Forest (RF) consistently outperforms k-Nearest Neighbours (kNN) and Support Vector Machine (SVM). RF’s dominance in accurately identifying video data is evident even when the sequence length is altered, demonstrating its durability and efficacy.

The reason behind how RF is performing the best among all other models is, it classifies pre-extracted features, which are typically structured keypoint data (e.g., XY coordinates of joints such as the shoulders, elbows, and knees) that are derived from a pre-trained Human Pose Estimation (HPE) feature class. The complex vision task is effectively offloaded to the HPE model through this procedure, which provides the RF with a simplified, low-dimensional, and geometrically rich feature vector that is highly relevant to pose classification. These findings highlight the significance of utilising supervised learning algorithms, particularly Random Forest (RF), for tasks that involve the availability of labelled data and the need for high accuracy.

However, it is crucial to acknowledge the limitations stemming from the current dataset size which is only 77 RGB videos, while sufficient for initial validation, is relatively small and lacks high variability in backgrounds and subject demographics

Unsupervised learning

On the other hand, the silhouette scores of unsupervised learning algorithms such as Agglomerative Clustering, GMM, and K-Means clustering exhibit different levels of performance when considering different quantities of clusters and sequence lengths. In the majority of configurations, K-Means consistently demonstrates superior performance compared to Agglomerative Clustering and GMM, as evidenced by its higher silhouette scores. This implies that the K-Means algorithm exhibits superior performance in partitioning the data into discrete clusters, irrespective of the quantity of clusters or the length of the sequence. In comparison to supervised learning, unsupervised learning exhibits lower silhouette scores, suggesting that supervised learning algorithms typically outperform unsupervised learning in accurately classifying video data. This yoga position detection system’s great accuracy makes it useful in three key practical fields. Tele-Yoga/Remote Instruction allows for the delivery of real-time, automatic feedback on student alignment, ensuring the quality and safety of virtual classes. For Physical Rehabilitation, the model acts as an objective monitoring tool to verify that patients maintain proper posture throughout therapeutic exercises, hence reducing injury risk and increasing treatment success. Finally, Advanced Fitness Monitoring extends beyond basic activity tracking to deliver quantitative, granular feedback on form and technique, raising the bar for personalised digital fitness coaching. Collectively, these applications demonstrate the system’s ability to transform from a scientific model to an influential, adaptable health and wellness tool.

Real-time feasibility is an important part of a yoga position identification system’s real-world implementation. Our chosen pipeline, which uses pre-extracted, low-dimensional keypoint data classified by a Random Forest (RF) model, is naturally optimised for this purpose. Unlike complex deep learning architectures (such as 3D CNNs or massive Transformers), which require high computing resources, the RF classifier offers extraordinarily fast inference speeds and a tiny memory footprint, making it perfect for deployment on edge devices.

Conclusion and future work

The present study presents a new pipeline for the recognition of Shankha Prakashna Kriya. Based on the results and analysis, the comparison demonstrates that supervised learning outperforms unsupervised learning in the domain of video data classification. Supervised learning algorithms, such as Random Forest, consistently demonstrate their effectiveness in handling labelled video data and producing accurate classifications, as evidenced by their higher accuracy scores. However, unsupervised learning algorithms such as K-Means clustering exhibit limited effectiveness in clustering video data, as evidenced by silhouette scores, in comparison to supervised learning. The results highlight the significance of utilising pose extracted data and supervised learning techniques, for precise yoga pose classification , particularly in fields like video analysis where achieving high accuracy is crucial. Moreover, the exploration of exploratory possibilities lies in the development of user-friendly applications that utilise the model for personalised yoga training, which can adapt to individual skill levels and progress. Future research will focus on extending the robustness and generalizability of the proposed pipeline. The current high performance, while promising, is measured against a single-subject or low-diversity dataset. To reduce the danger of overfitting and poor generalisation in real-world settings, a main goal will be to collect and integrate a bigger, more diverse multi-person yoga dataset.The existing reliance on 2D keypoints is intrinsically constrained by perspective, making geometrically distinct stances appear identical. As a result, incorporating 3D skeleton elements is an important area for expansion. More improved Human Pose Estimation (HPE) models or depth-sensing cameras will be used to record the joints’ genuine three-dimensional coordinates.