Personalized rehabilitation approach for reaching movement using reinforcement learning

Pelosi, Avishag Deborah; Roth, Navit; Yehoshua, Tal; Itah, Dorit; Braun Benyamin, Orit; Dahan, Anat

doi:10.1038/s41598-024-64514-6

Download PDF

Article
Open access
Published: 30 July 2024

Personalized rehabilitation approach for reaching movement using reinforcement learning

Avishag Deborah Pelosi¹,
Navit Roth¹,
Tal Yehoshua²,
Dorit Itah³,
Orit Braun Benyamin¹ &
…
Anat Dahan²

Scientific Reports volume 14, Article number: 17675 (2024) Cite this article

5662 Accesses
22 Citations
Metrics details

Subjects

Abstract

Musculoskeletal disorders challenge significantly the performance of many daily life activities, thus impacting the quality of life. The efficiency of the traditional physical therapy programs is limited by ecological parameters such as intervention duration and frequency, number of caregivers, geographic accessibility, as well as by subjective factors such as patient’s motivation and perseverance in training. The implementation of VR rehabilitation systems may address these limitations, but the technology still needs to be improved and clinically validated. Furthermore, current applications generally lack flexibility and personalization. A VR rehabilitation game simulation is developed, which focuses on the upper-limb movement of reaching, an essential movement involved in numerous daily life activities. Its novelty consists in the integration of a machine learning algorithm, enabling highly adaptive and patient-customized therapeutic intervention. An immersive VR system for the rehabilitation of reaching movement using a bubble popping game is proposed. In the virtual space, the patient is presented with bubbles appearing at different locations and is asked to reach the bubble with the injured limb and pop it. The implementation of a Q-learning algorithm enables the game to adjust the location of the next bubble according to the performance of the patient, represented by his kinematic characteristics. Two test cases simulate the performance of the patient during a training program of 10 days/sessions, in order to validate the effectiveness of the algorithm, demonstrated by the spatial and temporal distribution of the bubbles in each evolving scenario. The results show that the algorithm learns the patient’s capabilities and successfully adapts to them, following the reward policy dictated by the therapist; moreover, the algorithm is highly responsive to kinematic features’ variation, while demanding a reasonable number of iterations. A novel approach for upper limb rehabilitation is presented, making use of immersive VR and reinforcement learning. The simulation suggests that the algorithm offers adaptive capabilities and high flexibility, needed in the comprehensive personalization of a rehabilitation process. Future work will demonstrate the concept in clinical trials.

Empowering stroke recovery with upper limb rehabilitation monitoring using TinyML based heterogeneous classifiers

Article Open access 24 May 2025

A quantitative assessment of the hand kinematic features estimated by the oculus Quest 2

Article Open access 14 March 2025

Associations between pain-related fear and lumbar movement variability during activities of daily living in patients with chronic low back pain and healthy controls

Article Open access 02 October 2024

Introduction

Musculoskeletal disorders and diseases have a debilitating impact on the performance of daily life activities (ADL), negatively affecting the quality of more than 50% of the individuals aged more than 50 years old suffering from chronic conditions in the developed countries¹. Traditional rehabilitation of motor capabilities and physical therapy involve the repetition of task-oriented movements according to personal disability, in one-on-one therapist-patient sessions. The rehabilitation program is generally customized to the unique needs of the individual patient and includes personal condition assessment, definition of goals and strategy to attain them, limited training period, pain management and measures of improvement^2,3. The process is time-consuming, expensive, and yields practical and ecological limitations concerning the number of caregivers, therapy session duration, intervention frequency and geographic accessibility. Challenges in traditional physical therapy also include patient’s adherence to long-term or intensive rehabilitation regimen, which is a critical factor in achieving actual improvement and recovery in many impairment conditions⁴. The integration of assistive VR technologies addresses the disadvantages of conventional rehabilitation, proposing an alternative healthcare intervention system⁵.

Virtual Reality (VR) immerses the patient in a realistic or imaginary environment while stimulating the performance of cognitive and motor functions required in daily life. It elicits the user to interact with a virtual environment, generally in an engaging and playful way, thus encouraging repetitive and otherwise boring movements. Moreover, a virtual platform is highly flexible, as it enables home-training, adaptivity in the implementation of on-demand modifications, immediate feedback, therapist monitoring and maximal treatment customization^3,6,7.

The use and benefits of VR in physical rehabilitation have been investigated for several pathological conditions⁵, including post-stroke injuries^8,9,10,11, cerebral palsy¹², trauma-related impairment¹³, multiple sclerosis¹⁴, Parkinson’s disease¹⁵, chronic joint pain and arthritis¹⁶, and more, showing positive rehabilitation outcomes.

This research proposes a VR rehabilitation program simulation, which focuses on the upper-limb movement of reaching for an object while standing, an essential movement involved in numerous daily life activities. Its novelty consists in the implementation of a machine learning algorithm, enabling highly adaptive and patient-customized therapeutic intervention.

Current VR simulation systems are used in physical treatment and evaluation metrics, but very few of them integrate machine learning; the majority of current VR solutions involve optimization algorithms and conditional statements to adapt the difficulty level of rehabilitation “serious games” (games that include a didactic purpose) according to performance^17,18,19,20 and/or to supply feedback to the patient^21,22. Chen et al.²³ developed a virtual reality interactive game called Super Pop VR for children with cerebral palsy, where the player is supposed to pop as many bubbles as possible in a given time by moving his arms. The platform evaluates in real time reaching movement kinematic parameters such as movement time, path length, shoulder range of motion, elbow range of motion, and more. Game settings can be adjusted, like bubble location, size and shape, appearing and retaining times. Similar adaptive systems allow the assessment of upper limb functionality in people with tetraplegia²⁴ or recovering from stroke²⁵, with immersive²⁶ or non-immersive platforms²⁷. To quantify performance, kinematic variables are generally recorded and analysed, according to position/motion sensors and the challenge level can be configured accordingly²⁸.

Nevertheless, the configuration is generally set a-priori according to periodic reports, and a fully customized and adaptive system with performance learning capabilities is still lacking.

Machine learning algorithms offer the possibility to upgrade VR rehabilitation serious games by adding higher degrees of interactivity and adaptivity; an artificial intelligent agent can continuously adapt the complexity of the task according to the reinforcement it receives from the patient’s performance and therapist’s requirements. Recent works have proposed the use of reinforcement learning in the fields of Computer-assisted cognitive training^29,30 and assistive robotics^29,31. The implementation of machine learning in VR upper-limb rehabilitation games focuses on diagnosis optimization, transfer of knowledge and monitoring progress throughout the rehabilitation process^32,33. Few pioneering pilot studies make use of neural networks³⁴ and Dyna-Q reinforcement learning algorithm³⁵ to give real time feedback and modify game difficulty on-line. Barzilay and Wolf³⁴ measure kinematics and muscle activity in a specific task to train a network, which can then generate new tasks for the individual trainee, based on his expected performance. Their immersive (HMD) system is designed for subjects with neuromotor disorders; the patient is asked to follow with his fingertip a planar trajectory, the adaptive system collects information from biometric equipment, produces an inverse model estimation of the measured performance and teaches, by mirroring, a more appropriate trajectory.

Tsiakas et al.³⁵ use a desktop tele-rehabilitation system with multisensory data collection (including pain related facial expressions and speech) to give feedback and modify difficulty level of three exercises by implementing a Dyna-Q reinforcement learning (RL) algorithm. Contrarily to the model-free Q-learning methodology employed in our work, the Dyna-Q RL algorithm combines the model-free approach with offline simulation steps, aimed at requiring fewer real interactions although allowing some inaccurate actions.

Sekhavat³⁶ makes use of Multiple-Periodic Reinforcement Learning for difficulty adjustment in separate periods; the desktop game consists of hitting the brightest ball among an arch of balls and measures the user’s performance (win/lose) with Kinect controllers. The algorithm adapts game properties such as speed, size of balls and distance between arches of ballsby means of an elaborated RL-based algorithm, including multiple periods and multiple states, as well as probabilistic two-level actions; the goal of the adaptation is to optimize user’s satisfaction accommodating for a win/lose balance, without considering the kinematic characteristics of patient’s movement or therapist’s rehabilitation recommendations.

It is important to note that all current works are feasibility studies, and their small number is still not enough to validate the feasibility and benefits of VR adaptive rehabilitation systems³⁷; there is imperative need of methods, simulations and experiments in the field of adaptive, customized rehabilitation.

The current work proposes a novel immersive VR adaptive simulation approach, that tailors the VR space to the patient and adapts in real time a rehabilitation game, implementing a Q-learning algorithm and considering both patient’s biomechanical kinematics and therapeutic strategy. Compared to available RL algorithm-based similar works^35,36, this proof-of-concept simulation offers an adaptive model-free machine learning approach for the rehabilitation of upper-limbs in ADL tasks; the adaptation is based on real performance only, and allows the customization of kinematic parameters and therapeutic recommendations.

Methods

Game description

An immersive VR rehabilitation system for reaching movement using a bubble popping game is proposed. The game, implemented using Unity 3D Game Engine (Unity Technologies, v2021.1.9.f1), is conceived as a rehabilitation solution for all upper-limb disabilities, regardless of the pathology or injury involved. The system includes an Oculus Quest 2 (Meta Quest 2) Head-Mounted Display (HMD) with 4 embedded infrared cameras (Fig. 1). A customized application is employed to collect and store the controller’s position and orientation. This study reports the results of a proof-of-concept simulation, hence the Meta Quest 2 device is not actually used in a real setup.

In the virtual space, the patient is presented with bubbles appearing at different locations and he is asked to reach the bubble with his injured limb and to pop it by touch. The movement involves mainly the shoulder and elbow joints.

The game's space is divided into a 3D set of cubes (not visible to the player). At each step of the game, a bubble appears at a selected cube. The patient needs to physically reach and touch the bubble to explode it. If he doesn’t succeed, the bubble disappears after a fixed time, determined by the therapist’s setup. When a bubble is popped, it instantly disappears making a pop sound. In the virtual space the patient can see the elapsed time and his score, based on popping success/failure and kinematic measures. At the following step, a new bubble appears at another cube of the game space. Figure 1 shows the virtual space as it presents a bubble to the user. The position of the patient’s hand is represented by a hand in the virtual space. The dimensions and boundaries of the virtual space are defined according to the limits of reaching movements of a person without disabilities.

The use of the Reinforcement Learning approach and specifically Q-learning algorithm enables the customization of the rehabilitation serious game. The system allows personalized allocation of the bubbles as a function of the user’s changing physical capabilities; the game adjusts the location of the next bubble according to the previous performance of the patient. The individual user abilities represent the environment that the AI agent is to learn. The agent decides where to present the next bubble in order to get a maximum reward. The reward function models the policy of a personalized treatment, determined by therapist’s instructions. For example, it may give a high reward for bubble locations that are challenging for the patient; the algorithm will then learn the patient’s abilities and accordingly prefer to present bubbles in areas where the reward is high. This approach provides a dynamic personalization and adaptation of the VR rehabilitation process.

Reinforcement Q-learning

Reinforcement learning is an area of machine learning that models how software agents should take actions in an environment in order to maximize a reward³⁸: the selection of an action is achieved by a learning agent that interacts with the environment and tries to maximize a cumulative reward. The selected action may affect not only the immediate reward but also the next state of the environment and subsequent rewards. To obtain high rewards, a reinforcement learning agent will prefer an action that it has tried in the past and found to be effective in producing a reward. To make better action selections in the future, it also has to try actions that it has not selected before. At each time step t, the agent receives a representation of the environment’s state S_t and selects an action A_t. At the following time step, according to the selected action, the agent receives a reward R_t+1 and the environment’s state is S_t ₊ ₁.

Reinforcement learning systems have two decision-making approaches. The model-based method involves the use of a world predictive model, asking questions like “what will happen if I do A?”, to select the best action A1 (such as Markov Decision Process). Conversely, the model-free approach skips the modelling step and directly learns a control policy.

Model-free reinforcement learning offers significant advantages in rehabilitation due to its adaptability, flexibility, and real-time feedback capabilities. Unlike model-based methods that require precise models of the environment or patient conditions, model-free approaches can adjust to varying patient responses and changing conditions without extensive retraining. This adaptability is crucial in rehabilitation, where progress can be nonlinear and unpredictable.

This work makes use of Q-learning algorithm, which is a popular reinforcement learning method, based on a model-free exploration and iterative learning of the environment. Q-learning is a natural candidate for model free RL, as it allows to model the 3D space by states and actions, transitioning between them. One key advantage of Q-learning is that it is an off-policy RL as compared to on policy—Sarse, Monte Carlo, temporal difference algorithms. Off-policy algorithms allow suboptimal decisions, therefore balancing exploration and exploitation effectively, which is suitable to dynamic rehabilitation settings.

The goal of Q-learning is to learn a policy that maximizes rewards. This is achieved by creating a table (Q-table) of scores (Q-values) for all possible scenarios. The table has rows representing states {S} and columns representing possible actions {A}. The Q-values are stored in the Q-table and are updated during training of the algorithm by the Q-function, which generates new optimal Q-values based on both present and expected future information. After a first initialization, the Q-table is continuously updated as the algorithm proceeds with its learning process, which ends as soon as the optimal expected value of the total reward is found.

After selecting an action during learning, the Q-value for a given state and action is replaced by a new value, evaluated as follows³⁹:

$$Q^{new} \left( {S_{t} ,A_{t} } \right) \leftarrow \left( {1 - \alpha } \right) \cdot Q\left( {S_{t} ,A_{t} } \right) + \alpha \cdot \left[ {R_{t} + \gamma \cdot max\left( {Q\left( {S_{t + 1} ,A} \right)} \right)} \right]$$

(1)

The algorithm replaces the current Q-value, $Q\left( {S_{t} ,A_{t} } \right)$ with a new one, ${Q}^{new}\left({S}_{t},{A}_{t}\right)$ for the state ${S}_{t}$ and the action ${A}_{t}$, by calculating a weighted average of old and new information, where

$\alpha$ is the learning rate that controls how much of the difference between previous Q-value and newly proposed Q-value is considered. A factor of 0 means the agent does not learn at all and a factor of 1 makes the agent consider only the most recent information.
R_t is the reward for taking action ${A}_{t}$ at state ${S}_{t}$,
$max\left(Q\left({S}_{t+1},A\right)\right)$ is the maximal expected future reward given the new state ${S}_{t+1}$ and all possible actions at the new state $A$.
$\gamma$ is the discount rate that determines the present value and the importance given to future rewards. This is a value between 0 and 1. As γ is close to 0 the agent is concerned only with maximizing immediate rewards, whereas a γ value of 1 leads the agent to consider only long-term rewards.

The new Q-value then includes two parts weighted by the learning rate: the old value and the new learned value, which is the sum of the immediate reward and a discounted estimate of the optimal future value.

For the algorithm to explore new actions, other than the ones with maximum reward, an exploration rate $0<\varepsilon <1$ is adopted, such that the algorithm selects a random action a pre-determined percentage of the times:

$$A_{t} = \left\{ {\begin{array}{*{20}l} {maxQt\left( a \right)} \hfill & {with \,\,probability \, \epsilon } \hfill \\ {random\left( a \right)} \hfill & {with\,\, probability\, \left( {1 - \varepsilon } \right)} \hfill \\ \end{array} } \right.$$

(2)

The action ${A}_{t}$ is chosen either randomly with a probability of $\varepsilon$, or according to the best-scored action in Q-table with a probability of $\left(1-\varepsilon \right)$. This approach is called epsilon-greedy Q-learning⁴⁰ and provides a balance between exploitation and exploration by random choice.

Implementation of the Q-learning algorithm

In the game, the patient is presented with bubbles in a 3D space which he must reach and pop. At each round a bubble appears at a different location. To translate the problem of how to present the bubbles in the most important locations as a reinforcement learning problem, the following representation is used: the player represents the environment, and the state is the location of the bubble. The goal of the agent is to present bubbles in locations that will result in maximal reward.

Each cube in the space is given an initial reward value of zero. During the game, the algorithm learns the patient’s physical reaching abilities and changes the reward value accordingly (Eq. 1). As the actual ability of a patient is probably not symmetrical in all directions, different bubble locations will result in different rewards.

The agent learns to choose the next bubble location (the cube location) that maximizes the total reward value.

The reinforcement problem is modelled as follows:

State (S)—{XYZ | X, Y, and Z describe the row, column, and depth of a cube in the space correspondingly}
Action (A)—{Up, Down, Right, Left, Forward, Backward}
Q-table—A 2D table that allows to calculate the maximum expected future reward for each state and each action. This is represented in a 2D table where every cell contains the value of Q-function Q(S, A), describing the Q-value of every possible action for each given state. The rows and the columns in the table represent the states and the possible actions, respectively, from each state.

For each state all the columns representing actions that lead to states within the game boundaries are initialized to zero, otherwise the values are initialized to − 100 (for example from the top row the action representing UP will be − 100).

At each step, the bubble appears at a certain location. That is the state of the environment. According to the performance of the patient at that stage the Q-table is updated, and the best action to take from that stage is selected. According to the selected action a bubble is selected in the next location.

The therapist can define a training strategy that can be translated to the reward function. The reward links the presentation of bubbles and the kinematic performance of the patient. There are two aspects to take into consideration: does the patient manage to pop the bubble? What are the kinematic characteristics of his movement towards the bubble?

Performance is thus represented by three cases:

(a)
The patient doesn’t pop the bubble,
(b)
The patient easily pops the bubble,
(c)
The patient pops the bubble with difficulty—the movement is slow, unsmooth etc.

The training strategy is determined by the therapist, given that the higher the reward, the higher the number of bubbles presented in the location.

For the sake of the present simulation the following strategy is adopted: the therapist wishes to encourage practice at the limits of the patient’s range of motion (ROM) where there is a difficulty in movement, to display a small number of bubbles where movement is easy and display some bubbles where the bubbles are still unreachable:

(a)
When the patient does not reach/pop the bubble, the given reward value is low: fewer bubbles are then presented in the relevant zone so that the patient will have the opportunity to try again without being frustrated or physically injured.
(b)
When the patient pops the bubble “easily”, the reward value is given a low negative value. This means that this is not the zone to be trained and there is less interest in presenting bubbles in the zone.
(c)
When the patient reaches the bubble “with difficulty”, the reward is high, so that more bubbles are displayed in the encouraged practice zone.

The difficulty of the patient is represented by the kinematic characteristics of his movement: they may address pathlength (m1), smoothness (m2), time duration from bubble appearance to pop (m3) and maximal speed (m4). In future experimental work, these parameters can be calculated from the data collected by the VR tracking system. In real game conditions, a VR headset including hand controllers with built-in cameras and sensors can supply hand tracking and identify the location of body and hands in the virtual space in relation to the headset.

The VR system can track movements without any additional external sensor, thus allowing to measure hand position and to evaluate the kinematic parameters m1–m4. In the present simulation, they are given normalized values ranging from 0 to 1 and a kinematics score K is calculated by averaging the kinematic characteristics:

$$K = \frac{1}{4} \cdot \left( {m1 + m2 + m3 + m4} \right)$$

(3)

In real trials a different weight may be determined for each kinematic parameter, taking into consideration the specific characteristics of the pathology involved. The kinematic parameters are normalized in order to reflect adequately the difficulty in performing the reaching task. For example, a longer pathlength could erroneously represent a distant bubble, instead of depicting a longer path that the patient performs because of his physical disability. Therefore, normalized pathlength m1 takes into account the shortest pathlength to the bubble, so that a normalized longer distance represents an actual longer path to the bubble. Similarly, normalization of path duration, m3, should include information of an average time to target based on an average movement velocity, measured in current or former states.

The reward function $R_{t}$ for the conditions and adopted strategy cited above is calculated as follows:

$$\begin{array}{*{20}l} {(a)\, if\,\; pop = 0} \hfill & {R_{t} = 0.2} \hfill \\ {(b)\, if\,\; pop = 1 \,\&\, K \ge 0.8} \hfill & {R_{t} = - \;0.3} \hfill \\ {(c)\, if\,\; pop = 1 \,\& \,K < 0.8} \hfill & {R_{t} = 1 - 0.5 \cdot K} \hfill \\ \end{array}$$

(4)

It is important to note that the patient’s performance may change throughout the treatment. Some abilities may improve (due to practice and recovery) while others might deteriorate. Accordingly, the reward given when reaching a certain bubble may change. The Q-table is updated at each iteration according to the pseudo-code in Table 1, and thus reflects changes in patient’s performance.

Table 1 The pseudo-code of the Q-learning algorithm used in this study.

Full size table

Parameters of the algorithm

The use of the algorithm requires the selection of the parameters in Eq. (1):

$\alpha$, the learning rate, is defined as 0.4. This value balances previous with current performance allowing to learn without being too influenced from a single round.
$\gamma$, the discount factor, is defined as 1, to allow maximal influence of future rewards. This encourages the display of bubbles at locations with less reward that may lead to locations with higher reward.
$\varepsilon$, the exploration rate defines the rate of random choice in the evaluation of the new Q-value, ${Q}^{new}\left({S}_{t},{A}_{t}\right)$ in Eq. (1). An epsilon-greedy approach allows to balance random exploration of random choices and exploitation of choices with maximal reward. At the beginning of the learning process the agent has no information about the capabilities of the user, therefore random action selection allows the algorithm to explore the environment and to discover new areas where there are motor difficulties. As the training progresses, we want to rely more on what the algorithm has learnt about the patient. We therefore use a dynamically decreasing epsilon-greedy approach as proposed by Wang et al.⁴¹ and Even-Dar and Mansour⁴². The following equation is used in this study:
$${\varepsilon }\;{ = }\;\max \left( {\min \left( {\frac{E}{Total\, \,actions*Total \,\,reward},0.5} \right),0.3} \right) with \,E = 0.9$$
(5)

Accordingly, the exploration rate in the first sessions is high, i.e.

$\varepsilon =0.5$ and lowers to $\varepsilon =0.3$ in later sessions.

Episodes—the number of VR sessions in this simulation is defined as 10 days.
Iterations—the number of bubbles presented for the patient to pop each day is 400. A comparative study also considers the following values: iterations = 50, 400, 1000.

The simulation

In order to validate the algorithm, a simulation is presented in this paper as a preliminary stage before conducting clinical trials. The simulation flowchart is shown in Fig. 2: as soon as a bubble is displayed, kinematic characteristics featuring a test case are collected from a data file. The reward value is then calculated according to Eq. (4) and the Q-learning algorithm is applied to select the location of the next bubble as shown in Table 2.

Table 2 The kinematic characteristics of test case 1.

Full size table

The VR simulation space is defined as a three-dimensional rectangular box, divided into 8 × 8 × 4 units: 8 cells in the X and Y axes, describing horizontal (right-left) and vertical (up-down) movements, respectively, and 4 cells in the Z axis, the movement’s depth (close-far).

The dimensions and the boundaries of the space are customized to each specific patient according to his height: anthropometrics measurements enable to relate individual’s height with his arm’s length⁴³. The patient’s shoulder joint is located at location (X,Y,Z) = (4,4,0) as shown in Fig. 3.

The space is divided into three areas: the target area (zone 1), the zone around the target (zone 2) and the remaining space (zone 3). These zones represent locations of varying performance: the patient reaches differently in every zone (with varying degrees of difficulty/ease), thus presenting kinematic characteristics appropriate to each zone and scenario. A matrix of kinematic characteristics (m1, m2, m3, m4) changing with time and location simulates the patient’s performance and is used to calculate the reward.

The simulation addresses the everyday activity of reaching objects placed on a shelf in the right-upper corner of the patient’s kitchen/bathroom/closet. The patient has difficulties in reaching the shelf in the right upper corner of the VR space, depicted as zone 1 in red in Fig. 3.

The simulation provides a treatment tailored to the dynamic physical state of the user, defining kinematic characteristics that are changing in time during an estimated 10 sessions/days program of VR therapy.

Two test cases are considered, to assess the validity of the proposed Q-learning approach. To simplify the simulation and the understanding of its results, the course of the therapeutic program is divided in 3 periods of time with different kinematic performance: days 1–3, days 4–6 and days 7–10. In zones 1 and 2, that include relatively few spatial units (4 and 12, respectively), constant kinematic characteristics are adopted, whereas in the much larger zone 3 (240 units) a random variation in user’s high and low performance accounts for more realistic spatial kinetic differences: good performance is represented by random values of K values between 0.8 and 0.9 (K is defined in Eq. (3)), and low kinematic scores are evaluated as 30% of the good performance values (random K values between 0.24 and 0.27).

Test case 1

John comes to VR rehabilitation after a stroke. He suffers from muscle weakness and stiffness of his right arm and is advised occupational therapy to regain functionality in his daily activities. In the first three sessions (days 1–3) he cannot reach the right upper corner of the space (zone 1), he manages to pop bubbles in zone 2 with difficulty and easily reaches the bubbles in zone 3. In zone 2, his movements are relatively stiff and slow, resulting in low kinematic scores. In the next 3 sessions (days 4–6) his condition improves: he does succeed in popping bubbles in the target zone 1 with difficulty, the quality of his movements improve in zone 2 and easily reaches the bubbles in zone 3. In the last 4 VR sessions (days 7–10), John experiences fatigue that results in a minor slow-down in his recovery process. He pops bubbles in zone 1 with difficulty, his kinematic scores slightly decrease in zone 2, and he performs well as usual in zone 3.

The kinematic parameters corresponding to test case 1 are shown in Table 2: normalized values of m1 (course length), m2 (trajectory smoothness), m3 (time duration from bubble appearance to pop) and m4 (maximal speed) are evaluated in the different stages of the case study.

Test case 2

Mary suffers from a frozen shoulder. The range of motion of her right arm is impaired and she struggles with clothing and raising her hand to reach the right upper corner of her living space. In the first three sessions of her VR rehabilitation (days 1–3) she cannot reach target zone 1. She manages to pop bubbles in zone 2 with difficulty and pain but she can reach bubbles in zone 3 with ease. However, the training results in an increase in the intensity of the pain that she experiences, so that in the following 3 days (days 4–6) neither of zones 1 or 2 can be reached and in zone 3 bubbles are popped but movement is painful and difficult. In the last 4 sessions, Mary’s physical condition improves, and her kinematic scores are similar to those of days 1–3. The kinematic parameters corresponding to test case 2 are shown in Table 3.

Table 3 The kinematic characteristics of test case 2.

Full size table

Results

Test case 1

Table 4 summarizes the results for test case 1. These include the number of bubbles, the number of bubbles divided by zone size and the sum of the rewards in each zone and training day.

Table 4 Results of test case 1: number of bubbles, number of bubbles/zone size and sum of rewards in every zone and training day.

Full size table

According to the adopted treatment strategy, in the first three days the reward is the highest in zone 2, where John pops the bubbles with difficulties, whereas in zones 1 and 3 the reward should be low and negative, respectively. On day 1 the algorithm displays 43 bubbles in zone 1, 172 in zone 2 and 185 bubbles in zone 3. During the first three days the number of bubbles increases in zone 2 and decreases in zones 1, 3 as expected. The number of bubbles is divided by zone size, in order to take into account the significant difference between the zones: zone 1 includes 4 cubic units, zone 2 counts 12 units and zone 3 includes 240 units. Consequently, in zone 3 the number of bubbles divided by zone size is very low (< 1), showing very sparse bubble appearance in this large zone.

In the next three days (days 4–6), John’s performance improves in both zones 1 and 2: the number of bubbles in zone 1 immediately rises significantly, showing that the algorithm adapts quickly to the changing environment (the patient). The interest in zones 2 and 3 decreases, as John’s kinematic parameters there improve. In the last four days (days 7–10) John’s kinematic performance decreases in zone 2 from K = 0.775 (days 4–6) to K = 0.6. This change may seem minor but the algorithm responds with an initial increase in the number of bubbles in zone 2 (from 46 on day 6 to 67 on day 7) and a bubble number decrease in zone 1 (from 337 on day 6 to 317 on day 7), followed by a stabilization of the number of bubbles in all zones, with a consistent increase tendency in zone 1 and decrease in zones 2, 3. On day 10, 371 bubbles are presented in zone 1, 39 in zone 2 and 10 in zone 3, the bubble number divided by zone size is 87.75 for zone 1, 3.25 for zone 2 and 0.04 for zone 3.

Figure 4 shows graphically the number of bubbles appearing in the VR space on days 1, 4, 7, 10 for test case 1. The figure clearly shows a gradual increase in the number of bubbles in the upper right far corner of the space from day 1 to day 10.

In Fig. 5 the number of bubbles divided by zone size is plotted for each zone and day, for test case 1 (Fig. 5a) and test case 2 (Fig. 5b); the figure enables the visualization of bubble number variation in each zone with time. In test case 1 (Fig. 5a), the number of bubbles divided by zone size is so small in zone 3 that its variation is unnoticeable graphically. The main players in this test case are the bubbles in zone 1 and zone 2; the number of bubbles divided by zone size shows opposite line trends in zones 1 and 2, dictated by the reward policy and case scenario. From day 4 both zones converge to their final value with a significantly higher number of bubbles in zone 1, the target area. The adaptivity of the algorithm is demonstrated in its response to changes on days 4 and 7.

Another indication of the algorithm’s learning process is the reward achieved. Table 4 shows the sum of rewards obtained in each zone and day. The sum of rewards in zone 2 is higher than in the other zones in the first three days, when John can’t pop bubbles in zone 1 and is encouraged to train in zone 2. Later on, when John can pop in zone 1, the training focus moves to this zone, with the highest sum of rewards. These results are consistent of course with the variation in the number of bubbles. In Fig. 6 the total reward, i.e., the sum of rewards in all zones, is displayed for each day. It is shown that the algorithm learns the patient’s abilities and therefore manages to get higher rewards as the training progresses. The total reward converges as required with this number of iterations (400).

Test case 2

Table 5 summarizes the results of test case 2. In the first three days Mary cannot pop bubbles in zone 1, therefore the reward function defines a higher reward for zone 2 than for zone 1. In the following four days Mary’s condition worsens, she cannot reach zone 2 and has difficulties in reaching zone 3; in the last four days her abilities return to their initial values.

Table 5 Results of test case 2: number of bubbles, number of bubbles/zone size and sum of rewards in every zone and training day.

Full size table

It is interesting to note that although John and Mary start with identical kinematic scores, on day 1 the number of bubbles displayed by the algorithm is not identical for both cases in all zones. This is due to the fact that the exploration rate parameter is high in the first days of training ($\varepsilon =0.5$), ensuring that the agent acts with a “healthy” degree of randomness. On the second and third days of learning, the algorithm converges quickly, displaying similar number of bubbles in both cases.

As shown in Table 5, in the first and last days the largest number of bubbles is displayed in Zone 2 (315 bubbles on day 3, 318 bubbles on day 10) and in days 4–6 in zone 3 (329 on day 4, 349 on day 6). This is consistent with the fact that Mary can pop bubbles in zone 2 at the beginning and the end of her rehabilitation, therefore the algorithm displays more bubbles in this zone. In days 4–6 her physical condition worsens and zone 2, together with zone 1, are not reachable, resulting in the algorithm’s adaptation presenting bubbles mainly in zone 3.

Figure 5b shows the number of bubbles displayed in the virtual space on days 1, 4, 7, and 10 for test case 2. The figure highlights the dispersion of the bubbles on days 1, 4 and 7; on the first day, the algorithm has little knowledge of the patient’s capabilities and spreads the bubbles in all zones. On day 4, Mary’s physical condition changes significantly resulting in new information and dispersion of the bubbles in zone 3. From day 7 to 10 the algorithm focuses on zones 1 and 2 since Mary can pop again bubbles in these zones, so that on day 10 a larger number of bubbles is displayed in the right upper far corner of the space (Fig. 7).

The same trends are observed in Fig. 5b, describing the number of bubbles divided by zone size in the different zones along 10 days of training. Contrarily to test case 1 where the bubbles/zone size in zone 3 are extremely low (Fig. 5a), in this case the relative number of bubbles increases in zone 3 on days 4–6 following case scenario. The relative number of bubbles in zones 1 and 2 decreases abruptly but is still higher than the relative bubble number in zone 3. It is reasonable to assume that if Mary’s low kinematic measures in zones 1 and 2 were to last in the next sessions (after day 6) the relative number of bubbles in zone 3 would eventually exceed the ones in zones 1 and 2. In this case, the increase in bubbles/zone size in zone 2 is restored as soon as Mary’s physical condition improves back to its initial state.

The algorithm learns the patient’s abilities by calculating the reward and updating the Q-table at each iteration. As the bubbles cover a space of 8 × 8 × 4 = 256, the algorithm will need many iterations to learn the patient. We therefore compared the performance of the algorithm with 50, 400 and 1000 daily iterations. The results for 400 iterations are described in detail in the sections above. The comparative study is shown in Fig. 8, describing for each zone the normalized number of bubbles for different daily iteration values in test case 1. In order to compare appropriately the results for different iterations, the number of bubbles is normalized by both zone size and the number of iterations multiplied by the nominal number of iterations (400):

$$Normalized\,\, Bubble\, \,Number\left( {iterations} \right) = \frac{Number\, \,of\, \,Bubbles\, \,in\,\, zone}{{zone\,\, size}} \times \frac{400}{{iterations}}$$

(6)

The normalized number of bubbles is very similar for 400 and 1000 iterations in all zones. In addition, both trends show fast and similar convergence when adapting to kinematic changes on days 4 and 7. The results for 50 iterations present major differences compared to those of 400 and 1000 iterations; both the initial values on day 1 and the convergence process on days 2–3, 5–6 and 8–10 differ significantly from the higher iterations cases. The algorithm with 50 iterations does not converge in zones 2, 3 to the expected bubble values on days 3 and 6, as well as in zones 1 and 3 on day 6. On day 10, after 4 days of adjustment, the algorithm manages to reach reasonable values although still quite far from the ones achieved with 400 and 1000 daily iterations. We may then conclude that the number of iterations is a pivotal factor to consider in assessing the quality of the simulation.

Discussion

Upper extremity function plays an important role in one’s ability to perform activities of daily living (ADL) and the loss of functionality in the use of the upper limb is a predictor of quality of life. Therefore it is essential to implement treatments aimed at its recovery.

This project focuses on the rehabilitation of reaching movement which is involved in many ADL. Reaching involves mainly the use of the proximal joint functions (shoulder and elbow). The wrist, elbow and shoulder joints work together to take the hand through space towards various targets. As every patient has specific disabilities evolving with time and training, different practice strategies are needed. An approach for a personalized and adaptive rehabilitation VR game is proposed. In this proof-of-concept study the simulation of two test cases is performed to answer the following questions.

Does the algorithm spread the bubbles according to the treatment strategy and patient’s motor abilities?

The validity of the approach, integrating Q-learning algorithm, is explored by implementing it in two simulative test cases presenting different kinematics characteristics. From the results shown in Figs. 4, 6 and Tables 3, 4, the number of bubbles is adaptive for each zone and case, demonstrating the strength of the suggested algorithm. The number of bubbles is displayed in each zone according to the reward function expected for the different case dynamics.

The total kinematic score in each case is defined in the algorithm as an average of 4 kinematic scores representing features such as smoothness and maximal speed, previously described in movement control strategies^44,45. Kinematic characteristics in future research may address other/additional features, as well as different formulas for the total kinematic score. Thus, this formula may be controlled according to the therapist rehabilitation protocol; for example, these kinematic characteristics may be multiplied by a weight factor. Such changes are easy to implement thanks to this simple and yet efficient algorithm.

Does the algorithm adapt to changes in patient’s performance throughout the treatment?

In a rehabilitation process, the program is generally customized to the unique needs of the individual patient, thereby defining specific objectives and timetable. The implementation of the rehabilitation plan must respond to changes in the rate of progress and eventual medical changes. Thus, adjustments according to the patient's condition are required^2,3. The course of the therapeutic simulation program is divided into 3 periods of time with different constant kinematic performance characteristics in each zone. In both simulated cases the kinematic characteristics are set to represent a decrease, increase or no-change in value, relative to the previous period. The results show that the algorithm adapts to these kinematic changes; moreover, the algorithm is highly responsive to kinematic features’ variation. The adaptation is achieved immediately after the change, resulting in significant fluctuations in bubble number and distribution already within the first day of change (days 1, 4 and 7). In most real scenarios, the kinematic characteristics of the patient are assumed to change more moderately, so that the algorithm’s high responsiveness will result in its faster convergence to the adequate bubble dispersion.

How many daily iterations are needed for the algorithm to learn the patients’ abilities?

For the reinforcement algorithm to learn the environment, it must explore the different states. In the proposed approach, patient’s motoric abilities are modelled as the environment. For this to be feasible, it is necessary to perform a sufficient number of iterations for the algorithm to explore the different possible states while ensuring that the patient has the ability to fulfill the training.

In the test cases simulated here, the Q-table contains 8 × 8 × 4 = 256 rows (all possible bubble locations), and 4 columns (of possible actions). The consequent number of exploration possibilities is deducted and should be carefully considered when determining the number of iterations.

Bubble dispersion as a function of session number (day) is compared in test case 1 for different number of daily iterations (see Fig. 8); for 400 and 1000 iterations, the algorithm’s convergence rate, representing the algorithm’s ability to respond to spatial and temporal changes in user’s performance, is significantly higher than for 50 iterations. The higher the number of iterations, the higher the convergence rate, thus the more adaptive and responsive the algorithm. However, the number of iterations is also representative of the number of bubbles displayed in each session, resulting in an intrinsic conflict; in an actual rehabilitation session, a larger number of iterations means a larger number of attempted/performed movements or a longer training time. For example, if each bubble is present in the space for approximately 3 s, a number of 1000 iterations equivalent to 1000 bubbles results in 50 min of training. This is not physically feasible and even probably detrimental to patient’s condition, engagement level and recovery process. Therefore, the number of iterations is a fundamental parameter to consider, as it affects significantly the efficiency of both the algorithm and the therapeutic process. An optimization of the number of bubbles/iterations is required in future work, considering therapist’s specific recommendations regarding training time. The number of iterations may also be adaptive and vary from one session to another, according to the gradual change in patient’s condition.

How are the reward function and algorithm parameters defined?

In Reinforcement Learning the reward function is an incentive mechanism that tells the agent what is correct and what is wrong, using reward and punishment. To benefit from RL in healthcare it is crucial to choose a correct reward function⁴⁶. This selection is somewhat complex, as it requires expertise and it needs to balance short- and long-term reward. In the case studies in this simulation, a reward function is defined, such that gives a high reward in areas where the patient has difficulties, a smaller reward in areas that he fails to reach and a negative reward for areas where he has no difficulties. It is important to note that in future implementations an interface can be created to allow a physiotherapist or occupational therapist to define different rewards, across days and across types of disabilities. For example, for some types of injuries it may be harmful to practice at areas that cannot be reached. In this case a negative reward may be defined when a bubble is not popped. In addition, the system may be adapted to the protocols required for functional reach tests (FRT)⁴⁷.

Other parameters may also be defined in the future by a therapist. For example, the learning rate may be determined, according to the expected rate of recovery. Another parameter is the Exploration rate $\varepsilon$, which determines how many choices will be random and how many greedy. An epsilon-greedy approach allows to balance exploitation of maximal reward and exploration by greedy choices. A dynamically decreasing exploration rate starts with a higher rate of exploration and decreases according to the number of iterations and the accumulated reward. Different values in Eq. (5) and different strategies to balance exploitation and exploration can be applied according to prior knowledge of the patient’s abilities. For example, if a more dynamic recovery is expected we may wish for a higher exploration rate, while if we expect very stable abilities a strategy with more exploitation may be more efficient.

In summary, the flexibility of the framework, showing in many configurable parameters, implies the need to pay special attention to its accessibility to physicians in embedding a personalized therapeutic policy; the implementation of the game in clinical trials should include an intuitive, user-friendly graphic user interface, with a practical description of the game’s features, monitoring of patient’s medical information, current range of motion, recommended number and duration of training sessions, performance, as well as customized parameters relevant to the algorithm as mentioned above. The model-free reinforcement learning algorithm and its independence of specific medical conditions simplifies the translation of its variables into practical multiple-choice GUI instructions, but the kinematic parameters (m1-m4), measuring performance, should be discussed with professionals and tailored according to pathology and therapeutic strategy. To this end, a bank of documented kinematic parameters and a customizable scale of importance (determining the weight of parameters m1–m4 in Eq. (3)) may be added.

What are the limitations of the framework?

In implementing the proposed framework in real clinical trials, the validity and accuracy of the patient’s kinematic parameters, based on the measurement of hand position, should be considered. Several works discuss the accuracy and reliability of head-mounted displays and controllers in tracking translational and rotational movement, such as HTC VIVE, Oculus Rift S, Oculus Touch and Meta Quest 2^48,49,50,51. The systems are found to be suitable for biomechanical and motor rehabilitation applications. However, the heterogeneity of the findings suggests that specific setups, including tracking space size and distance between measurement points influences the positional error⁵⁰. Carnevale et al.⁵² evaluated the accuracy of the Oculus Quest 2 (Meta Quest 2) VR system compared to a Qualisys optical capture system in measuring translational and rotational displacements for shoulder rehabilitation applications. In a translational range of 200 to 700 mm (corresponding to anthropometric forearm and upper limb evaluations), they reported a mean absolute error of 13.52 ± 6.57 mm at 500 mm from the HMD in the x-direction. The maximum mean absolute error for rotational displacements was found to be 1.11 ± 0.37° for a rotation of 40° around the z-axis. In a different setup, Abdlkarim et al. reported an average positional error of the fingertip of 11 mm, an average finger joint angle error of 9.6° and an average temporal delay of 38 ms⁵¹. Extrapolating the results in⁵² to the proposed VR game in our work with the current bubble spatial distribution (Fig. 3), a displacement error of 2.2% in the VR game space may correspond to 2.75 mm error in bubble position precision, depending on space size definition. Moreover, these studies do not consider variations in movement’s velocity; increasing movement speed will affect (increase) positional error. Therefore, a setup-specific analysis and evaluation of spatial and temporal accuracy will be imperative in determining bubble distribution resolution in the 3D virtual space before clinical trials.

Although immersive VR-based interventions are shown in literature to be beneficial in motor and neurorehabilitation, larger scale clinical studies are needed, including a systematic validation of VR tracking devices.

The resolution of bubble distribution space influences the number of states in the 3D space that implicitly changes the number and/or duration of iterations/sessions needed. Because of the nature and number of possible actions (Up, Down, Right, Left, Forward, Backward) dictated by the space, the basic Q-learning algorithm proposed in this work manages to optimize bubble location with relatively few iterations, without the need of offline simulation models (as opposed to³⁵). However, the present bubble distribution, related to the number of states in space, requires optimization and validation in real rehabilitation interventions, ensuring a balance between training time and algorithm performance, as mentioned earlier.

Conclusions

Virtual Reality has a great potential in extending rehabilitation therapy in the clinic and at home. For the alternative interventions to be effective, adaptive and personalized rehabilitation programs are much needed; they may introduce a new generation of engaging, flexible home-training with immediate feedback, adaptive response and monitoring. This study proposes the implementation of Reinforcement Learning, and specifically Q-learning in the customization of a rehabilitation serious game conceived for reaching movement treatment. The game presents in a virtual space bubbles to be reached and popped by the user, and adjusts in real time the number of bubbles in space according to therapeutic strategy and patient’s personal temporal and spatial abilities. The algorithm is validated by the simulation of two different test cases, presenting changing kinetic characteristics in time. The game presents more bubbles in the zones where a higher reward is defined and shows fast adaptive response to the changes in the patient’s ability throughout days of practice. The simulation suggests that the algorithm offers good adaptive capabilities, and its relative simplicity enables further implementation of a user interface that will provide the therapist with the possibility to adapt the program’s parameters to each individual therapeutic strategy. Moreover, the therapist will be able to control the timing of bubbles’ appearance/disappearance as well as the overall training time, thereby determining the total number of bubbles and attempted movements per training. Future work will demonstrate the concept by clinical trials, where the kinematic characteristics of the patient will be measured and analyzed in real time to provide the algorithm with the data currently supplied by the simulation of test cases.

Data availability

The datasets generated and/or analysed during the current study, including the simulation code and simulation matrices, are available in the GitHub repository at https://github.com/anatdhn/RLSimulaton/tree/main.

Abbreviations

VR:: Virtual reality
ADL:: Activities of daily living
HMD:: Head-mounted display
AI:: Artificial intelligence
ROM:: Range of motion
FRT:: Functional reach test

References

Watkins-Castillo, S. & Andersson, G. United States Bone and Joint Initiative: The Burden of Musculoskeletal Diseases in the United States (BMUS). The Burden of Musculoskeletal Diseases Int He United States (2014).
Faria, A. L., Andrade, A., Soares, L. & Badia, S. B. I. Benefits of virtual reality based cognitive rehabilitation through simulated activities of daily living: A randomized controlled trial with stroke patients. J. Neuroeng. Rehabil. 13, 1–12 (2016).
Article Google Scholar
Sveistrup, H. Motor rehabilitation using virtual reality. J. NeuroEng. Rehabil. 1, 10. https://doi.org/10.1186/1743-0003-1-10 (2004).
Article PubMed PubMed Central Google Scholar
Pishkhani, M. K., Dalvandi, A., Ebadi, A. & Hosseini, M. A. Adherence to a rehabilitation regimen in stroke patients: A concept analysis. Iran. J. Nurs. Midwifery Res. 25, 139 (2020).
Article PubMed PubMed Central Google Scholar
Rose, T., Nam, C. S. & Chen, K. B. Immersion of virtual reality for rehabilitation—Review. Appl. Ergon. 69, 153–161. https://doi.org/10.1016/j.apergo.2018.01.009 (2018).
Article PubMed Google Scholar
Rizzo, A. & Kim, G. J. A SWOT analysis of the field of virtual reality rehabilitation and therapy. Presence Teleoper. Virtual Environ. 14, 119–146. https://doi.org/10.1162/1054746053967094 (2005).
Article ADS Google Scholar
Burdea, G. C. Virtual rehabilitation—Benefits and challenges. Methods Inf. Med. 42, 519–523. https://doi.org/10.1055/s-0038-1634378 (2003).
Article CAS PubMed Google Scholar
Aida, J., Chau, B. & Dunn, J. Immersive virtual reality in traumatic brain injury rehabilitation: A literature review. NeuroRehabilitation 42, 441–448. https://doi.org/10.3233/NRE-172361 (2018).
Article PubMed Google Scholar
Maier, M., Rubio Ballester, B., Duff, A., Duarte Oller, E. & Verschure, P. F. M. J. Effect of specific over nonspecific VR-based rehabilitation on poststroke motor recovery: A systematic meta-analysis. Neurorehabil. Neural Repair 33, 112–129. https://doi.org/10.1177/1545968318820169 (2019).
Article PubMed PubMed Central Google Scholar
Jack, D. et al. Virtual reality-enhanced stroke rehabilitation. IEEE Trans. Neural Syst. Rehabil. Eng. 9, 308–318 (2001).
Article CAS PubMed Google Scholar
Moreira, M. C., De Amorim Lima, A. M., Ferraz, K. M. & Benedetti Rodrigues, M. A. Use of virtual reality in gait recovery among post stroke patients-a systematic literature review. Disabil. Rehabil. Assist. Technol. 8, 357–362 (2013).
Article PubMed Google Scholar
Reid, D. T. Benefits of a virtual play rehabilitation environment for children with cerebral palsy on perceptions of self-efficacy: A pilot study. Pediatr. Rehabil. 5, 141–148 (2002).
Article PubMed Google Scholar
Deutsch, J. E. et al. Rehabilitation of musculoskeletal injuries using the Rutgers Ankle haptic Interface: three case reports. In Proceedings of EuroHaptics 2001 Conference (2001).
Fulk, G. D. Locomotor training and virtual reality-based balance training for an individual with multiple sclerosis: A case report. J. Neurol. Phys. Ther. 29, 34–42 (2005).
Article PubMed Google Scholar
Mirelman, A. et al. Virtual reality for gait training: Can it induce motor learning to enhance complex walking and reduce fall risk in patients with Parkinson’s disease?. J. Gerontol. Ser. A Biol. Sci. Med. Sci. 66A, 234–240 (2011).
Article Google Scholar
Elshazly, F. A. A. et al. Comparative study on virtual reality training (VRT) over sensory motor training (SMT) in unilateral chronic osteoarthritis—A randomized control trial. Int. J. Med. Res. Health Sci. 5, 7–16 (2016).
Google Scholar
Ma, M. & Bechkoum, K. Serious games for movement therapy after stroke. In Conference Proceedings—IEEE International Conference on Systems, Man and Cybernetics. https://doi.org/10.1109/ICSMC.2008.4811562 (2008).
Cameirão, M. S., Badia, S. B. I., Oller, E. D. & Verschure, P. F. M. J. Neurorehabilitation using the virtual reality based rehabilitation gaming system: Methodology, design, psychometrics, usability and validation. J. Neuroeng. Rehabil. 7, 1–14 (2010).
Article Google Scholar
Nirme, J., Duff, A. & Verschure, P. F. M. J. Adaptive rehabilitation gaming system: On-line individualization of stroke rehabilitation. In Proceedings of the Annual International Conference of the IEEE Engineering in Medicine and Biology Society, EMBS. https://doi.org/10.1109/IEMBS.2011.6091665 (2011).
Saurav, K., Dash, A., Solanki, D. & Lahiri, U. Design of a VR-based upper limb gross motor and fine motor task platform for post-stroke survivors. In Proceedings—17th IEEE/ACIS International Conference on Computer and Information Science, ICIS 2018. https://doi.org/10.1109/ICIS.2018.8466538 (2018).
Lafond, I., Qiu, Q. & Adamovich, S. V. Design of a customized virtual reality simulation for retraining upper extremities after stroke. In Proceedings of the 2010 IEEE 36th Annual Northeast Bioengineering Conference, NEBEC 2010. https://doi.org/10.1109/NEBC.2010.5458130 (2010).
Wu, W., Wang, D., Wang, T. & Liu, M. A personalized limb rehabilitation training system for stroke patients. In 2016 IEEE International Conference on Robotics and Biomimetics, ROBIO 2016. https://doi.org/10.1109/ROBIO.2016.7866610 (2016).
Chen, Y., Garcia-Vergara, S. & Howard, A. M. Effect of a home-based virtual reality intervention for children with cerebral palsy using super pop VR evaluation metrics: A feasibility study. Rehabil. Res. Pract. 2015, 1–9 (2015).
Google Scholar
Dimbwadyo-Terrer, I. et al. Effectiveness of the virtual reality system Toyra on upper limb function in people with tetraplegia: A pilot randomized clinical trial. Biomed. Res. Int. 2016, 1–12 (2016).
Article Google Scholar
Cameirão, M. S., Badia, S. B. I., Duarte, E., Frisoli, A. & Verschure, P. F. M. J. The combined impact of virtual reality neurorehabilitation and its interfaces on upper extremity functional recovery in patients with chronic stroke. Stroke 43, 2720–2728 (2012).
Article PubMed Google Scholar
Cyrino, G., Tannus, J., Lamounier, E., Cardoso, A. & Soares, A. Serious game with virtual reality for upper limb rehabilitation after stroke. In Proceedings—2018 20th Symposium on Virtual and Augmented Reality, SVR 2018. https://doi.org/10.1109/SVR.2018.00006 (2018)
Kilbride, C. et al. Rehabilitation via home based gaming exercise for the upper-limb post stroke (rhombus): Protocol of an intervention feasibility trial. BMJ Open 8, e026620 (2018).
Article PubMed PubMed Central Google Scholar
Osgouei, R. H., Soulsby, D. & Bello, F. Rehabilitation exergames: Use of motion sensing and machine learning to quantify exercise performance in healthy volunteers. JMIR Rehabil. Assist. Technol. 7, e17289 (2020).
Article Google Scholar
Zini, F., Le Piane, F. & Gaspari, M. Adaptive cognitive training with reinforcement learning. ACM Trans. Interact. Intell. Syst. 12, 1–29 (2022).
Article Google Scholar
Stasolla, F. & Di Gioia, M. Combining reinforcement learning and virtual reality in mild neurocognitive impairment: a new usability assessment on patients and caregivers. Front. Aging Neurosci. 15, 1189498. https://doi.org/10.3389/fnagi.2023.1189498 (2023).
Article PubMed PubMed Central Google Scholar
Andriella, A., Torras, C. & Alenyà, G. Cognitive system framework for brain-training exercise based on human-robot interaction. Cognit. Comput. 12, 793–810 (2020).
Article Google Scholar
Dobrovsky, A., Borghoff, U. M. & Hofmann, M. Improving adaptive gameplay in serious games through interactive deep reinforcement learning. https://doi.org/10.1007/978-3-319-95996-2_19 (2019).
Maskeliunas, R. et al. Deep reinforcement learning-based iTrain serious game for caregivers dealing with post-stroke patients. Information (Switzerland) 13, 564 (2022).
Google Scholar
Barzilay, O. & Wolf, A. Adaptive rehabilitation games. J. Electromyogr. Kinesiol. 23, 182–189 (2013).
Article PubMed Google Scholar
Tsiakas, K., Huber, M. & Makedon, F. A multimodal adaptive session manager for physical rehabilitation exercising. In 8th ACM International Conference on Pervasive Technologies Related to Assistive Environments, PETRA 2015—Proceedings. https://doi.org/10.1145/2769493.2769507 (2015).
Sekhavat, Y. A. MPRL: Multiple-periodic reinforcement learning for difficulty adjustment in rehabilitation games. In 2017 IEEE 5th International Conference on Serious Games and Applications for Health, SeGAH 2017. https://doi.org/10.1109/SeGAH.2017.7939260 (2017).
Zahabi, M. & Abdul Razak, A. M. Adaptive virtual reality-based training: A systematic literature review and framework. Virtual Real. 24, 725–752 (2020).
Article Google Scholar
Sutton, R. S. & Barto, A. G. Reinforcement Learning, Second Edition: An Introduction—Complete Draft (The MIT Press, 2018).
Google Scholar
Jang, B., Kim, M., Harerimana, G. & Kim, J. W. Q-learning algorithms: A comprehensive classification and applications. IEEE Access 7, 133653–133667 (2019).
Article Google Scholar
Bulut, V. Optimal path planning method based on epsilon-greedy Q-learning algorithm. J. Braz. Soc. Mech. Sci. Eng. 44, 106 (2022).
Article Google Scholar
Wang, H., Emmerich, M. & Plaat, A. Assessing the Potential of Classical Q-learning in General Game Playing. In Artificial Intelligence. BNAIC 2018. Communications in Computer and Information Science Vol. 1021 (eds Atzmueller, M. & Duivesteijn, W.) (Springer, Cham., 2019) https://doi.org/10.1007/978-3-030-31978-6_11.
Even-Dar, E. & Mansour, Y. Convergence of optimistic and incremental Q-learning. In Advances in Neural Information Processing Systems 2001 (eds Dietterich, T. G. et al.) 1499–1506 (MIT Press, Cambridge, 2001).
Drillis, R., Contini, R. & Bluestein, M. Body segment para/meters; A survey of measurement techniques. Artif. Limbs 25, 44–66 (1964).
Flash, T. & Hogan, N. The coordination of arm movements: An experimentally confirmed mathematical model. J. Neurosci. 5, 1688–1703 (1985).
Article CAS PubMed PubMed Central Google Scholar
Hogan, N. An organizing principle for a class of voluntary movements. J. Neurosci. 4, 2745–2754 (1984).
Article CAS PubMed PubMed Central Google Scholar
Coronato, A., Naeem, M., De Pietro, G. & Paragliola, G. Reinforcement learning for intelligent healthcare applications: A survey. Artif. Intell. Med. 109, 101964 (2020).
Article PubMed Google Scholar
Duncan, P. W. et al. Func- tional reach: A new clinical measure of balance. J. Gerontol. 45, M192–M197 (1990).
Article CAS PubMed Google Scholar
Shum, L. C., Valdés, B. A. & Van der Loos, H. M. Determining the accuracy of oculus touch controllers for motor rehabilitation applications using quantifiable upper limb kinematics: Validation study. JMIR Biomed. Eng. 4, e12291 (2019).
Article Google Scholar
Jost, T. A., Nelson, B. & Rylander, J. Quantitative analysis of the Oculus Rift S in controlled movement. Disabil. Rehabil. Assist. Technol. 16, 632–636 (2021).
Article PubMed Google Scholar
Holzwarth, V., Gisler, J., Hirt, C. & Kunz, A. Comparing the accuracy and precision of steamvr tracking 2.0 and oculus quest 2 in a room scale setup. In ACM International Conference Proceeding Series. https://doi.org/10.1145/3463914.3463921 (2021).
Abdlkarim, D. et al. A methodological framework to assess the accuracy of virtual reality hand-tracking systems: A case study with the oculus quest 2. bioRxiv (2022).
Carnevale, A. et al. Virtual reality for shoulder rehabilitation: Accuracy evaluation of oculus quest 2. Sensors 22, 5511 (2022).
Article ADS PubMed PubMed Central Google Scholar

Download references

Acknowledgements

The authors would like to acknowledge undergraduate students Eden Schwartz, Bar Korkos and Noam Drori for their work on developing the VR reaching algorithm and application.

Funding

This work was supported by the Flagship Project of Braude College of Engineering: “From the Heart” Center for Research and Applied Knowledge for Populations with Special Needs.

Author information

Authors and Affiliations

Mechanical Engineering Department, Braude College of Engineering, Karmiel, Snunit 51 St., 2161002, Karmiel, Israel
Avishag Deborah Pelosi, Navit Roth & Orit Braun Benyamin
Software Engineering Department, Braude College of Engineering, Karmiel, Snunit 51 St., 2161002, Karmiel, Israel
Tal Yehoshua & Anat Dahan
Western Galilee Medical Center, P.O.B. 21, 2210001, Nahariya, Israel
Dorit Itah

Authors

Avishag Deborah Pelosi
View author publications
Search author on:PubMed Google Scholar
Navit Roth
View author publications
Search author on:PubMed Google Scholar
Tal Yehoshua
View author publications
Search author on:PubMed Google Scholar
Dorit Itah
View author publications
Search author on:PubMed Google Scholar
Orit Braun Benyamin
View author publications
Search author on:PubMed Google Scholar
Anat Dahan
View author publications
Search author on:PubMed Google Scholar

Contributions

O.B.B. and D.I. conceptualized the study, A.D. defined the methodology, N.R., T.Y., A.D.P. and A.D. contributed to the simulation and analysis, N.R., A.D.P. and A.D. wrote and edited the manuscript. All authors reviewed the manuscript.

Corresponding author

Correspondence to Avishag Deborah Pelosi.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher's note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Pelosi, A.D., Roth, N., Yehoshua, T. et al. Personalized rehabilitation approach for reaching movement using reinforcement learning. Sci Rep 14, 17675 (2024). https://doi.org/10.1038/s41598-024-64514-6

Download citation

Received: 06 December 2023
Accepted: 10 June 2024
Published: 30 July 2024
Version of record: 30 July 2024
DOI: https://doi.org/10.1038/s41598-024-64514-6

Keywords

This article is cited by

Enhancing Lower Limb Exoskeleton Control in Rehabilitation Through Traditional Machine Learning Techniques: A Review
- Javlonbek Rakhmatillaev
- Nodirbek Kimsanboev
- Zafar Juraev
Journal of Bionic Engineering (2026)
Factors influencing on functional independence outcomes after hospitalization and rehabilitation in children with spinal cord injury
- Hong-Bo Zhao
- Xiang-Jiang Rong
- Yan-Qing Zhang
BMC Pediatrics (2025)
Personalized smart immersive XR environments: a systematic literature review
- Nan Huang
- Prashant Goswami
- Abbas Cheddad
The Visual Computer (2025)

Subjects

Abstract

Similar content being viewed by others

Empowering stroke recovery with upper limb rehabilitation monitoring using TinyML based heterogeneous classifiers

A quantitative assessment of the hand kinematic features estimated by the oculus Quest 2

Associations between pain-related fear and lumbar movement variability during activities of daily living in patients with chronic low back pain and healthy controls

Introduction

Methods

Game description

Reinforcement Q-learning

Implementation of the Q-learning algorithm

Parameters of the algorithm

The simulation

Test case 1

Test case 2

Results

Test case 1

Test case 2

Discussion

Does the algorithm spread the bubbles according to the treatment strategy and patient’s motor abilities?

Does the algorithm adapt to changes in patient’s performance throughout the treatment?

How many daily iterations are needed for the algorithm to learn the patients’ abilities?

How are the reward function and algorithm parameters defined?

What are the limitations of the framework?

Conclusions

Data availability

Abbreviations

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Competing interests

Additional information

Publisher's note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

This article is cited by

Enhancing Lower Limb Exoskeleton Control in Rehabilitation Through Traditional Machine Learning Techniques: A Review

Factors influencing on functional independence outcomes after hospitalization and rehabilitation in children with spinal cord injury

Personalized smart immersive XR environments: a systematic literature review

Search

Quick links