A humanoid control strategy based on deep reinforcement learning for enhanced comfort in lower limb rehabilitation robots

Jin, Yingrui; Zhang, Jiahao; Li, Wei; Yu, Jun; Wang, Zhaoyang; Sun, Shuyi

doi:10.1038/s41598-026-39011-7

Download PDF

Article
Open access
Published: 05 February 2026

A humanoid control strategy based on deep reinforcement learning for enhanced comfort in lower limb rehabilitation robots

Yingrui Jin¹^na1,
Jiahao Zhang³^na1,
Wei Li²,
Jun Yu¹,
Zhaoyang Wang¹ &
…
Shuyi Sun¹

Scientific Reports volume 16, Article number: 7370 (2026) Cite this article

1948 Accesses
14 Altmetric
Metrics details

Subjects

Abstract

The development of humanoid intelligent controllers represents a breakthrough in enhancing the effectiveness and comfort of rehabilitation training using lower limb rehabilitation robots for diverse gait patterns. In this study, a kinematic model of the lower limb rehabilitation robot is established based on a simplified link structure. For dynamic modeling, the Lagrange formulation is employed to analyze human lower limb motion from an energy-based perspective.Gait data were collected from five healthy subjects, with each performing 20 walking trials, yielding a total of 100 gait cycles for analysis. The acquired gait data are filtered and fitted to plan a reference trajectory of anthropomorphic joint angles suitable for rehabilitation training. Joint torques are computed from plantar force data and the dynamic model, serving as feedback for the controller and used in comfort assessment. A humanoid control strategy integrating Deep Reinforcement Learning (DRL) and a Proportional-Differential (PD) controller is proposed. This approach facilitates individualized gait trajectory planning for rehabilitation training by learning human gait characteristics. Simulation results demonstrate that the proposed method not only improves the intelligence and human-likeness of the robot but also significantly enhances comfort during training.

Bioinspired morphology and task curricula for learning locomotion in bipedal muscle-actuated systems

Article Open access 20 June 2025

Lower-extremity inter-joint coordination variability in active individuals with transtibial amputation and healthy males during gait

Article Open access 22 May 2024

Experiment-free exoskeleton assistance via learning in simulation

Article 12 June 2024

Introduction

Recently, the number of patients suffering from lower limb functional impairments, caused by conditions like apoplexy and spinal cord injury, has been rapidly on the rise, particularly among the elderly population. Scholars have confirmed that robot-assisted rehabilitation has positive effects on the recovery of patients’ motor ability^1,2,3,4. The interactive control, which is based on force information, is achieved by incorporating the recovery patient as a component of the control system⁵. The methods of combining control technology with artificial intelligence have emerged as research challenges for rehabilitation robots, attracting considerable attention from numerous scholars^6,7,8.

Rehabilitation robot control is a nonlinear control. Several researchers have suggested different control strategies, including PD control⁹, sliding mode control¹⁰, fuzzy control¹¹, etc. The dynamics of rehabilitation robots exhibit highly nonlinear characteristics, which pose challenges in establishing accurate mathematical models. Hu et al. used a three-dimensional motion capture system to collect human gait data and designed a robust adaptive class PD controller¹². The PD control, as a classical control strategy for robot control¹³-¹⁴, is widely used due to its simplicity. It is often combined with other controllers to achieve the desired control effect. Reinforcement learning is a machine learning algorithm¹⁵. Firstly, an agent evaluates the value function of the current policy it adopts. Secondly, based on the evaluation of the value function, the agent updates its policy. The agent acquires the optimal policy through continuous interaction with the environment¹⁶. In the context of rehabilitation robot research, combining DRL g with traditional control can enhance the intelligence level of the control system. Therefore, we adopted this control strategy in this paper.

Recently, DRL has emerged as a powerful framework for solving complex control problems in robotics. Notably, studies such as Robust walking control.musculoskeletal model²⁴ have coupled DRL with biomechanical models for more natural gait generation. Furthermore, advancements in wearable devices Numerical and experimental study. wearable exo-glove²⁶ and terrain adaptation Near-optimal terrain collision avoidance²⁵ highlight the trend towards intelligent and adaptive rehabilitation systems. Our work builds upon these developments by integrating DRL with traditional PD control to capture human gait characteristics directly from motion capture data, aiming to enhance comfort and adaptability.

In the dynamic modeling of the lower limb rehabilitation robots, the Lagrange formulation and the Newton-Euler formulation were used commonly¹⁷. The Newton–Euler formulation not only requires solving the mechanical equations of the links but also involves calculating the interaction forces between the links, which increases the computational difficulty and modeling complexity. The Lagrange formulation analyzes robot dynamics from the perspective of energy variations, establishing appropriate generalized coordinate systems for the robot and considering the kinetic and potential energies to determine its motion and interactions¹⁸. The Lagrange formulation avoids the analysis of internal forces within the entire robot and is suitable for analyzing human lower limb motion from the perspective of energy metabolism. This approach has been adopted in this paper.

This study utilizes a control strategy combining DRL with PD control. Leveraging a substantial dataset of 100 gait cycles from five subjects, we demonstrate that the strategy exhibits not only good effectiveness but also robust performance across diverse gait characteristics.This paper is organized as follows: In “The modeling of lower limb rehabilitation robot”, the modeling of the lower limb rehabilitation robot is introduced. The Lagrange formulation has been adopted to analyze the human lower limb motion. In “Design of model-based deep reinforcement learning PD controller”, the PD Controller based on DRL is designed. The experiment and simulation results are given in “Experiment and simulation results”. Finally, the conclusions are drawn in “Conclusions and outlook”.

The modeling of lower limb rehabilitation robot

The kinematic modeling of lower limb rehabilitation robot

For DRL, we need to build a suitable robot model. In this paper, a three-link two-dimensional robot model (as shown in Fig. 1) was built, and its motion equation was solved. The motion equation is established by using the energy constraints of the model, which is suitable for describing the robot’s walking.

In this model, the joint angles are defined as follows: $\:{\text{q}}_{1}$ represents the knee joint angle,$\:\:{\text{q}}_{2}$ represents the hip joint angle, and $\:{\text{q}}_{3}$ represents the torso inclination angle.

Among the equations, $\:l$ represents the length of each limb of the robot, $\:m$ is the weight of each limb of the robot, $\:{q}_{1},{q}_{2},{q}_{3}$ are the joint angles respectively. Simplify multiple joints of the lower limbs into a two-dimensional model of a three-link, which greatly reduces computational cost and difficulty.

To obtain gait data, the reflex marker was fixed to the outside of the subject’s joints, and the subject was required to walk freely in the built-up 3D motion capture space. Then record the coordinate information of the reflection marks of each joint and the data of the ground reaction force. Then you can calculate the angle change of each joint, that is as the Eqs. (1)–(3):

$$\:{q}_{1}={\text{tan}}^{-1}\left(\frac{ank\_R\_x-hip\_x}{hip\_z-ank\_R\_z}\right)$$

(1)

$$\:{q}_{2}={\text{tan}}^{-1}\left(\frac{hip\_x-ank\_L\_x}{hip\_z-ank\_L\_z}\right)$$

(2)

$$\:{q}_{3}={\text{tan}}^{-1}\left(\frac{head\_x-hip\_x}{head\_z-hip\_z}\right)(ank\_R\_x,ank\_R\_z)$$

(3)

Among them, $\:\left(ank\_L\_x,ank\_L\_z\right)$, $\:\left(ank\_R\_x,ank\_R\_z\right)$, $\:\left(hip\_x,hip\_z\right)$, and $\:\left(head\_x,head\_z\right)$ are the coordinate of ankle joint, hip joint, and torso respectively.

This study is based on gait data obtained from five healthy subjects. The key kinematic and spatiotemporal parameters were analyzed for the entire dataset. To exemplify the specific parameters extracted and their typical ranges, the data from one subject (67.5 kg, 1.68 m) are detailed in Table 1.

Table 1 Some walking characteristics of the subjects.

Full size table

According to the above formula, the angle of $\:{q}_{1}, {q}_{2}$ and $\:{q}_{3}$ changed. The angles of $\:{q}_{1}$ and $\:{q}_{2}$ changed into obvious periodicity and the symmetry of alternating left and right legs, and all change within a certain range. However, there is no obvious periodicity in the angle change of $\:{q}_{3}$ but it is only a smooth change in a small range.

For the spatial position of the joints, we are interested in the height changes of the hip and ankle joints (as shown in Fig. 2). The height of the hip joint and ankle joint both show periodic reciprocating movement within a certain range.

Figure 2 depicts the vertical displacement of the hip and ankle joints (derived from motion capture data of healthy participants) over a single gait cycle.

The three-link model is used for gait data fitting and simulation validation to better approximate the human-robot system. The theoretical derivation based on a two-link model simplifies the problem while capturing the core dynamics for controller design and stability analysis.

Dynamic modeling based on Lagrangian

In the dynamic modeling of the lower limb rehabilitation robots, the Lagrange formulation has been adopted in this paper to analyze the human lower limb motion from the perspective of energy metabolism.

The form of the Lagrange function is as Eq. (4),

$$\:L\:=\:E_k\:-\:E_p$$

(4)

where $\:{E}_{k}$ is the total kinetic energy and, $\:{E}_{p}$ is the total potential energy.

The Lagrange equation is,

$$\:\tau\:\:=\frac{d}{dt}\frac{\partial\:l}{\partial\:\dot{q}}-\frac{\partial\:l}{\partial\:q}$$

(5)

Connecting rod speed $\:{v}_{d1},{v}_{d2}$, The kinetic energy of the thigh and calf,

$$\:{E}_{k1}=\frac{1}{2}{m}_{1}{v}_{d1}^{2},{E}_{k2}=\frac{1}{2}{m}_{2}{v}_{d2}^{2}$$

(6)

Then, the total kinetic energy of the two-link model is,

$$\:{E}_{k}={E}_{k1}+{E}_{k2}=\frac{1}{2}{m}_{2}({l}_{1}{\dot{q}}_{hip}{)}^{2}+\frac{1}{2}{m}_{2}{l}_{1}{l}_{2}{\dot{q}}_{hip}({\dot{q}}_{hip}-{\dot{q}}_{knee}){cos}{q}_{knee}+\frac{1}{8}{m}_{1}({l}_{1}{\dot{q}}_{hip}{)}^{2}+\frac{1}{8}{m}_{2}\left({l}_{2}\right({\dot{q}}_{hip}-{\dot{q}}_{knee}){)}^{2}$$

(7)

The potential energies of the thigh and calf are, respectively,

$$\:{E}_{p1}={m}_{1}g{y}_{d1},{E}_{p2}={m}_{2}g{y}_{d2}$$

(8)

Then, the total potential energy of the two-link model is,

$$\:{E}_{p}={E}_{p1}+{E}_{p2}={m}_{2}g{l}_{1}{cos}{q}_{hip}+\frac{1}{2}{m}_{1}g{l}_{1}{cos}{q}_{hip}+\frac{1}{2}{m}_{2}g{l}_{2}{cos}({q}_{hip}-{q}_{knee})$$

(9)

Then establish the Lagrange equation,

$$\:{\tau\:}_{hip}=\frac{d}{dt}\frac{\partial\:l}{\partial\:{\dot{q}}_{hip}}-\frac{\partial\:l}{\partial\:{q}_{hip}}$$

(10)

$$\:{\tau\:}_{knee}=\frac{d}{dt}\frac{\partial\:l}{\partial\:{\dot{q}}_{knee}}-\frac{\partial\:l}{\partial\:{q}_{knee}}$$

(11)

Here, Eqs. (10) and (11) are the specific applications of the general Lagrange Eq. (5) to the hip and knee joints, respectively. They describe how the joint torques $\:{\tau\:}_{hip}\:$and $\:{\tau\:}_{knee}$ are derived from the partial derivatives of the Lagrangian L(kinetic energy minus potential energy) with respect to the joint angles qand their velocities dq.

By finishing Eqs. (7), (9), (10) and (11) the dynamic equation of the two-link model can be obtained,

$$\:\varvec{M}\left(\varvec{q}\right)\ddot{q}+\varvec{C}(\varvec{q},\dot{\varvec{q}})\dot{q}+\varvec{G}\left(\varvec{q}\right)=\varvec{\tau\:}$$

(12)

where $\:\varvec{M}\left(\varvec{q}\right)$ is the inertia matrix, $\:\varvec{C}(\varvec{q},\dot{\varvec{q}})$ is the centripetal force matrix, $\:\varvec{G}\left(\varvec{q}\right)$ is the gravity matrix, $\:\varvec{\tau\:}$ is the driving torque of the robot.

To balance the complexity of theoretical derivation and the fidelity of simulation verification, this study adopts models with varying levels of complexity. In the dynamic modeling based on the Lagrangian method, a two-link model is employed for derivation. This model features a simple structure, which can clearly demonstrate the energy relationship of the system and facilitate the stability analysis of the controller (Eq. 12).

In the simulation verification presented in “Experiment and simulation results”, to better approximate the real human gait, the three-link model described in “The kinematic modeling of lower limb rehabilitation robot” (Fig. 1) was adopted. There is a clear correspondence between the two models in terms of joint variables: the hip and knee joints in the two-link model correspond to the variable combinations o$\:{q}_{2}$ and $\:{q}_{1}$ in the three-link model, respectively. This correspondence ensures the coherence and consistency from dynamic theory to simulation experiments.

Design of model-based deep reinforcement learning PD controller

In this work, the Deep Deterministic Policy Gradient (DDPG) algorithm is adopted due to its suitability for continuous control tasks.As part of machine learning, Reinforcement learning is an optimal strategy for learning $\:\pi\:$ Maximize cumulative rewards. i.e. for time $\:t$, the agent receiving the observed value $\:{O}_{t}$ of the state $\:{S}_{t}$, simultaneous strategy $\:\pi\:\left({a}_{t}|{s}_{t}\right)$ gives an action $\:{a}_{t}$ in response to an observation and earns rewards $\:{r}_{t}=\left({s}_{t},{a}_{t},{s}_{t+1}\right)$ and a new state $\:{s}_{t+1}$, then repeats this process in order to get the largest cumulative reward. Figure 3 shows a general model for reinforcement learning.

Deep reinforcement learning algorithms

DRL aims to build a comprehensive framework for the interplay between learning, representation, and decision making¹⁹. At the same time, it also circumvents the limitations of traditional reinforcement learning. DRL enables agents to use neural networks to represent policies $\:{\pi\:}_{\theta\:}\left(s\right)$ and to make decisions from high-dimensional and unstructured input data. Among them, deep Q-learning directly uses the original high-dimensional data for reinforcement learning, and at the same time uses stochastic gradient descent for training, which ensures the stability of the training process and has good convergence characteristics. Moreover, it has stronger sample efficiency in high-dimensional state-action space, and can learn random strategies, which improves the robustness of the algorithm. In the Deep Q-Network (DQN) algorithm, the optimization metric function is defined by $\:J\left(\theta\:\right)$, make the parameter is adjusted towards j the direction of maximization, and finally the optimal strategy is given. The formula for Markov decision process (MDP) is shown in Eqs. (13) and (14). Although it is easier to converge to a local optimum, it is accompanied by a larger variance.

$$J(\theta )={E_{{\pi _\theta }}}[r]=\sum\limits_{{s \in S}} {d(s)} \sum\limits_{{a \in A}} {{\pi _\theta }(s,a)} {R_{s,a}}$$

(13)

$${\Delta _\theta }J(\theta )=\sum\limits_{{s \in S}} {d(s)} \sum\limits_{{a \in A}} {{\pi _\theta }(s,a){\Delta _\theta }\log {\pi _\theta }(s,a){R_{s,a}}}$$

(14)

Among the Eq. (13), $\:d\left(s\right)$ is the probability distribution of random state $\:a$, $\:{R}_{s,a}$ is the reward at state$\:\:a$ after acting $\:s$. The gradient of $\:J$ can be obtained using the strategy gradient method of Eq. (14), and the value function approximation technique can directly approximate $\:{Q}^{{\pi\:}_{\theta\:}}\left(s,a\right)$, as shown in Eq. (15).

$${\Delta _\theta }{\pi _\theta }(s,a)={\pi _\theta }(s,a)\frac{{{\Delta _\theta }{\pi _\theta }(s,a)}}{{{\pi _\theta }(s,a)}}={\Delta _\theta }\log {\pi _\theta }(s,a)$$

(15)

$${\Delta _\theta }J(w)={E_{{\pi _w}}}[{\Delta _w}\log {\pi _w}(s,a){Q^{{\pi _w}}}(s,a)]$$

(16)

A robot is a continuous, nonlinear, and strongly coupled system. The deep deterministic policy gradient (DDPG) algorithm²⁰ uses an actor-critic architecture that can handle off-policy data. The DDPG algorithm includes Actor network and Critic network. Among them, the Critic network is used to approximate, and the Actor network is used to approximate $\:{\pi\:}_{\theta\:}$. Specifically, Actor maps the state of the system to actions, and Critic updates this mapping to generate an increase $\:Q$value action. The network uses the approximation capabilities of neural networks to update the parameters of the action-value function $\:w$, Parameters of the actor network update strategy $\:\theta\:$. However, the Actor-Critic method produces approximation errors when using neural network approximation, and its approximate policy gradient is as Eqs. (17)–(19),

$$\:{Q}_{w}(s,a)\approx\:{Q}^{{\pi\:}_{\theta\:}}(s,a)$$

(17)

$$\:{\varDelta\:}_{\theta\:}J\left(\theta\:\right)\approx\:{E}_{{\pi\:}_{\theta\:}}\left[{\varDelta\:}_{\theta\:}{log}{\pi\:}_{\theta\:}\right(s,a\left){Q}_{w}\right(s,a\left)\right]$$

(18)

$$\:\varDelta\:\theta\:=\alpha\:{\varDelta\:}_{\theta\:}{log}{\pi\:}_{\theta\:}(s,a){Q}_{w}(s,a)$$

(19)

The DDPG does not use recent data to update the policy but records the data in a buffer for later use when updating the policy. This algorithm does not need any environment transition model to find the optimal policy, it only needs (state, action, next state, reward) tuples. So better tuning strategies can be achieved.

PD controller design based on deep reinforcement learning

The control strategy proposed in this study adopts a collaborative architecture integrating a deep reinforcement learning (DRL) agent and a proportional-derivative (PD) controller. As illustrated in Fig. 4, within the overall system framework, the DRL agent acts as the upper-layer decision-maker, which is responsible for learning human gait characteristics and generating adaptive control actions; the PD controller functions as the lower-layer tracker, ensuring the joint angles accurately track the reference trajectories.

Based on the established model and the derived Eq. (12), consider the total friction of the robot, the kinetic equation of the model is:

$$\:\varvec{M}\left(\varvec{q}\right)\ddot{q}+\varvec{C}(\varvec{q},\dot{\varvec{q}})\dot{q}+\varvec{G}\left(\varvec{q}\right)+{F}_{v}=\tau\:=\varvec{B}\varvec{u}$$

(20)

Among them: $\:\varvec{M}\left(\varvec{q}\right)$ is the positive definite inertia matrix, $\:\varvec{C}(\varvec{q},\dot{\varvec{q}})$ is the matrix of the Coriolis force and centrifugal force, $\:\varvec{G}\left(\varvec{q}\right)={\left[{G}_{11},{G}_{12},{G}_{13}\right]}^{T}$ is the matrix of the Gravity, $\:{F}_{v}$ is the total friction of the robot, $\:\tau\:$ is the torque, $\:\varvec{u}$ is the driving torque, $\:\varvec{B}$ is the transform matrix.

$$\:\tau\:={\varvec{k}}_{\varvec{p}}e+{\varvec{k}}_{\varvec{d}}\dot{e}$$

(21)

Among them: $\:{\varvec{k}}_{\varvec{p}},{\varvec{k}}_{\varvec{d}}$ are both the positive symmetry matrix, $\:\varvec{e}={\varvec{q}}_{\varvec{d}}-\varvec{q}$, $\:{\varvec{q}}_{\varvec{d}}$ is the desired joint angle. Combined Eqs. (20) and (21), Eq. (22) can be obtained:

$$\:\varvec{M}\left(\varvec{q}\right)({\ddot{q}}_{d}-\ddot{q})+\varvec{C}(\varvec{q},\dot{\varvec{q}})({\dot{q}}_{d}-\dot{q})+{K}_{d}\dot{e}+{K}_{p}e=0$$

(22)

It can also be expressed as: $\:\varvec{M}\left(\varvec{q}\right)\:\ddot{e}+\varvec{C}\left(\varvec{q},\dot{\varvec{q}}\right)\:\dot{e}+{K}_{p}e=-{K}_{d}\dot{e}$.

Consider the candidate Lyapunov function:

$$\:V=\frac{1}{2}{\dot{e}}^{T}M\left(q\right)\dot{e}+\frac{1}{2}{\dot{e}}^{T}{K}_{d}\dot{e}$$

(23)

This equation is positive definite. Its derivative of time is as the following Eq. (24):

$$\:\dot{V}={\dot{e}}^{T}M\ddot{e}+\frac{1}{2}{\dot{e}}^{T}M\left(q\right)\dot{e}+{\dot{e}}^{T}{K}_{d}e$$

(24)

There is an oblique symmetrical feature matrix which satisfy: $\:\dot{M}\left(q\right)-2C(q,\dot{q})=0$, substitution into the Eq. (24), and obtained:

$$\:\dot{V}={\dot{e}}^{T}(M\ddot{e}+D\dot{e}+{K}_{d}e)=-{\dot{e}}^{T}{K}_{d}\dot{e}\le\:0$$

(25)

Thus, it is ensured that the control system to be designed meets the stability in the sense of Lyapunov.

This system controls the hip and knee joints of the robot, with a total of 2 Degrees of Freedom (DOF) and equipped with two actuators, thus being a fully actuated system. This provides a foundation for precise trajectory tracking and attitude control. The gait control method of the robot proposed in this section is to apply the driving torques of the two hip joints to the robot in accordance with a specific gait, thereby driving the robot to run at the desired gait. It uses two constraints, which are functions of the model coordinates (the angular trajectory of the hips, posture, and swinging feet, and the required tilt angle). To improve the stability of the robot, PD control has been introduced. The PD controller is given by the following Eq. (26):

$$\:u={\varvec{k}}_{\varvec{p}}y+{\varvec{k}}_{\varvec{d}}\dot{y}$$

(26)

Among them, the control variable $\:y=[{y}_{1},{y}_{2}]$ is defined as $\:{y}_{1}={q}_{hip}-{q}_{hip,d}$ and $\:{y}_{2}={q}_{knee}-{q}_{knee,d}$, i.e., the tracking errors of the joint angles.

The DRL controller is designed with the following key components:

State Space (s): The state observed by the agent includes the joint angles and velocities, defined as s = [$\:{q}_{hip}$, $\:{q}_{knee}$, $\:{dq}_{hip}$, $\:{dq}_{knee}$].

Action Space (a): The agent outputs the desired joint torques, a = [$\:{u}_{hip}$, $\:{u}_{knee}$], which are applied to the robot model.

Reward Function (r): As defined in Eq. (27), the reward ris a composite function designed to encourage desired walking behavior while ensuring safety and comfort. Key terms include $\:{r}_{speed}$ for velocity tracking, and penalty terms like $\:{p}_{effort}$(Eq. 28) for minimizing joint torques to enhance comfort.

Network Architecture: The DDPG algorithm employs an Actor-Critic architecture. Both Actor and Critic networks are fully connected neural networks with two hidden layers (256 and 128 neurons, respectively) and ReLU activation functions.

Training Details: The training was conducted using the MATLAB DRL Toolbox, as described in “PD controller design based on deep reinforcement learning”. Key hyperparameters include: a learning rate of 1e−4 for both networks, a discount factor of 0.99, a replay buffer size of 1e6, and a mini-batch size of 128. The agent was trained for up to 10,000 episodes.”

The agent is composed of actor network and critic network. In Fig. 5, we observe this system in more detail. The actor gets the current status $\:q$, $\:{d}_{q}$ and gives actions $\:{u}_{1}$,$\:\:{u}_{2}$. The critic obtains the status and the actions, then estimates the value of the reward function $\:{Q}_{0}$ in advance. The environment represents the real physical world where the robot is trying to walk. The robot obtains the action and calculates the next state, as well as the reward function in the new state. The agent is optimized by minimize the loss function. Then, it propagates a gradient back to the critic and the actor to optimize the two networks.

Rewards are defined through an iterative process. The reward function is given by the following Eq. (27), where $\:r$ represents the reward value and $\:p$ represents the penalty value.

$$\:Q={r}_{speed}-{p}_{trunk}+{r}_{survival}-{p}_{zhip}-{p}_{za}-{p}_{pd}-{p}_{effort}$$

(27)

The design of the reward function fully incorporates the safety and movement quality of rehabilitation training. Specifically,$\:{r}_{speed}$ encourages a natural walking pace, while multiple penalty terms directly safeguard safety:$\:{\:p}_{trunk}$ is used to maintain trunk uprightness and prevent falls; $\:{p}_{zhip}$ stabilizes the height of the hip joints; $\:{p}_{za}$ ensures sufficient ground clearance for the swing foot to avoid tripping; $\:{p}_{effort}$ penalizes excessive torques, aiming to generate smooth and comfortable movements and reduce patient discomfort. The weight settings are based on the analysis of healthy gait characteristics and parameter tuning.

a key comfort-oriented penalty $\:{p}_{effort}$ defined as the normalized squared magnitude of the joint torques:

$$\:{p}_{effort}=\eta\:\sum\:_{i=1}^{n}{\tau\:}_{i}^{2}$$

(28)

where $\:{\uptau\:}$ is the torque at joint i, n is the number of actuated joints, and η is a weighting coefficient. Minimizing this term directly minimizes the interaction forces, which is our primary metric for physical comfort.”

Furthermore, to promote gait naturalness—a subjective aspect of comfort—we incorporate a penalty p_pdfor deviations from the reference human joint angles.$\:{p}_{pd}$ constraint is used in the PD controller, so the agent tries to imitate the angle of the robot’s legs. With the reward function, the agent can produce many normal gaits. $\:{p}_{effort}$ is a punishment for strong moments to avoid many strong moments. The reward rules for each item of the reward function are designed according to the characteristics of human walking, so that the final training result has a certain imitation of human nature. Once the appropriate environment and rewards are established, the agents can be trained.

The training process was conducted using the MATLAB DRL Toolbox. The agent was trained for a maximum of 10,000 episodes. The training was set to stop early if the average reward over 100 consecutive episodes exceeded a threshold of 250 or if the maximum number of episodes was reached, which served as the convergence criterion. Key hyperparameters include: a learning rate of 1e-4 for both actor and critic networks, a discount factor of 0.99, a replay buffer size of 1e6, and a mini-batch size of 128.

Experiment and simulation results

Establishment of experimental environment

The Nokov motion capture system was used for gait data collection. Reflective markers were placed on key anatomical landmarks of the subject according to a conventional gait model. The subject walked at a self-selected comfortable speed on a level walkway. The motion capture system synchronized with a floor-embedded 3D force plate to record ground reaction forces simultaneously. The scenario is illustrated in Fig. 6.

By utilizing an AD acquisition card and a 3D force measurement platform, the system can also synchronize the measurement of the changes in plantar force during human walking.This optical capture system provides high-accuracy kinematic data, which is crucial for reliable modeling and controller design, distinguishing it from vision-based estimation methods that may suffer from simplification errors.The scenario for gait data acquisition is shown in Fig. 6. The required human gait data in this study includes the human trunk, hip joints, knee joints, ankle joints, foot surface, and the ground reaction forces during walking. The lower limb rehabilitation robot used for training is shown in Fig. 7.

All experiments were performed in accordance with the Declaration of Helsinki and relevant guidelines. The study protocol was approved by the Institutional Review Board of Zhongyuan University of Technology. Informed consent was obtained from all participants prior to data collection.

Experimental results

In the DDPG algorithm, the evolution of rewards often exhibits fluctuations. Some factors that yield the highest rewards do not always lead to the expected gait in the most stable manner. However, the desired outcomes only occur when the average reward is high. During the training, the robot walks with a normal gait for a duration of 10 s. Therefore, it can be assumed that the robot can continue walking without falling. The training results are illustrated in Fig. 8.

The joint angles of the robot, namely θ₁, θ₂, and θ₃, exhibit periodic variations within the ranges of − 0.3 rad to 0.3 rad, − 0.3 rad to 0.4 rad, and 0 rad to 0.1 rad, respectively, showing some similarities with human joint angle variations. Additionally, during the walking process, the robot’s hip joint and foot sole exhibit small periodic changes along the Z-axis, resembling human walking characteristics.

From the perspective of limit cycles, the limit cycles of each joint angle form closed trajectories, representing the overall stability of the entire robot system. Therefore, applying DRL in the later stages of rehabilitation training can capture the characteristics of human walking, making the trained effects more human-like and achieving better rehabilitation outcomes.

The gait trajectory of the same subject was used as the reference trajectory. The performance indicators for joint angle tracking are shown in Fig. 9a,b. The results demonstrate that the hip and knee joints of the robot successfully track the desired trajectories. It performs well in all performance indicators, exhibiting good transient response characteristics, damping characteristics, and tracking performance. However, the hip joint shows slightly poorer transient response characteristics and damping properties.

Moreover, The experimental results of our proposed DRL-PD controller are presented in Fig. 9a,b.The graphs demonstrate that the hip and knee joints successfully track the desired trajectories with high accuracy. Notably, the tracking errors converge to less than 0.1 rad rapidly and maintain stability, showing excellent transient response and damping characteristics. It can also be observed that the joint torques of the hip and knee joints remain within stable ranges, specifically in the ranges of $\left( {0\sim 50{\rm{N}}\,{\rm{m}}} \right)$, $\:\left(-10\text{N}\sim60\text{N}\,\text{m}\right)$. The tracking performance was evaluated over 10 stable gait cycles. The mean and standard deviation of the RMSE for the hip joint are 0.028 ± 0.003 radand for the knee joint are 0.035 ± 0.004 rad. The small standard deviations indicate the controller’s consistent and reliable performance across multiple cycles.

On the basis of the aforementioned results, a comparative analysis can be conducted between the performance of the proposed controller and that of existing control methods. Although Fig. 9 only presents the experimental data of the method proposed herein, the key performance indicators derived from the results—including a steady-state error of less than 0.1 rad, fast convergence rate, and smooth torque output profiles (e.g., the amplitude of the hip joint torque remains stable within the range of 0–50 N·m)—can be qualitatively compared with the performance of traditional controllers reported in existing literature.

For instance, relevant studies on standard PD controllers⁹ have indicated that such controllers tend to suffer from larger oscillation amplitude and slower convergence rate when dealing with the nonlinear dynamic characteristics of human-robot systems. In contrast, while fuzzy controllers¹¹ possess a certain degree of adaptability, their control performance heavily relies on pre-defined rule bases. As a result, they cannot achieve the same control accuracy as the data-driven DRL-PD controller proposed in this paper in complex human-like trajectory tracking tasks.

In addition, To provide a more granular metric of smoothness, we calculate the root mean square (RMS) of the joint torque derivatives for our method and compare it with values representative of a standard PD controller⁹ applied to the same task. A lower RMS value indicates smoother torque transitions. The results, summarized in Table 2 below, show a significant reduction in torque variability for the DRL-PD controller, strongly supporting the claim of enhanced comfort.

Table 2 Statistical performance of the controller across all subjects and trials.

Full size table

The results in Table 2 lead to two key conclusions: First, the proposed DRL-PD controller significantly outperforms the standard PD controller, reducing tracking error by approximately 60%. More importantly, the exceptionally small standard deviations observed for each subject and across all trials provide strong statistical evidence for the controller’s consistency and its robustness to inter-subject variations.

An initial sensitivity analysis was conducted to assess the controller’s robustness to model uncertainties. By introducing a ± 5% variation in the link masses of the dynamic model, the tracking performance was re-evaluated. The results showed that the RMSE of the hip joint increased by less than 8%, and the knee joint by less than 10%, while the system remained stable. This indicates that the proposed controller possesses a degree of robustness against parameter variations, which is important for practical applications where model parameters may not be perfectly known.

Regarding computational efficiency, the training process for the DRL agent was computationally intensive, requiring approximately 12 h on a PC with an NVIDIA GeForce RTX 3060 GPU. This represents a current limitation for real-time deployment. However, once trained, the policy network can be executed efficiently. Future work will focus on policy distillation and model compression techniques to achieve real-time control on embedded systems with limited computational resources.

Conclusions and outlook

This paper designs a strategy based on real human gait data, aiming to improve the humanoid and intelligent capabilities of the robot, and optimizes the controllers. Firstly, to obtain human gait data, a 3D motion capture system is employed, and a gait data collection space is set up. Gait data of the subjects are collected and subjected to filtering and computation processes to obtain joint angles and variations in plantar forces.After that, an expected trajectory for rehabilitation training is fitted. This provides data references for designing the controllers, enhancing their humanoid characteristics, and supporting DRL. Lastly, considering the diversity of human gait, pre-planned gait trajectories may not accommodate all requirements. Therefore, a control strategy combining DRL with PD control is proposed. This strategy adheres to the concept of humanoid robot control and establishes a DRL framework for training the robot based on human gait characteristics. The acquired gait motion data is thoroughly analyzed to extract human gait features. The DDPG algorithm is employed to establish the DRL framework for continuous and high-dimensional, strongly coupled robot systems. The controller’s parameters are optimized, and simulation experiments are conducted to validate the effectiveness of the algorithm, achieving a certain degree of humanoid behavior, and ensuring the safety of robot motion.The proposed method enhances comfort by generating smoother interaction torques, as quantified and demonstrated in the results Nevertheless, with the help of an effective reward function, the trained agent will become powerful and exhibit human-like lower limb motion characteristics. After training, the robot’s gait closely resembles human gait, and most importantly, it can imitate the diversity of human walking.

The enhanced comfort and human-like gait generated by our strategy are expected to improve patient engagement and reduce training anxiety, which are crucial for rehabilitation efficacy. This work provides a foundational step toward personalized, adaptive robot-assisted therapy.

This study has limitations that point to clear future research directions.The simulation-based validation calls for long-term safety and effectiveness studies in real-world settings. Furthermore, addressing real-world implementation challenges, such as integration with clinical protocols and computational efficiency, will be critical for practical adoption. Future work will focus on these areas to enhance the strategy’s scalability and clinical impact.

Data availability

All data generated or analysed during this study are included in this published article and its supplementary information files.

References

Park, J. H. The effects of robot-assisted left-hand training on hemispatial neglect in older patients with chronic stroke: A pilot and randomized controlled trial. Medicine 100 (9), e24781 (2021).
Article PubMed PubMed Central Google Scholar
Weber, L. M. & Stein, J. The use of robots in stroke rehabilitation: A narrative review. Neuro Rehabil. 43 (1), 99–110 (2018).
Google Scholar
Lin, L. & Ma, L. The effect of lower limb rehabilitation robot training on balance and walking function in stroke patients with hemiplegia. Chin. J. Mod. Drug Appl. 15 (19), 228–231 (2021).
Google Scholar
Yun, Y., Zhen, X. & Ma, Y. A review of the effects of lower limb assistive devices on walking ability after stroke. World Latest Med. Inform. 84, 24–25 (2019).
Google Scholar
Li, G., Si, G. & Xu, F. Research progress on control strategy of lower limb exoskeleton robot. Chin. J. Rehabil. Med. 33 (12), 1488–1494 (2018).
Google Scholar
Tu, Y. et al. Adaptive admittance control of human-computer interaction force for lower limb exoskeleton rehabilitation robot. J. Xi’an Jiaotong Univ. 53 (06), 9–16 (2019).
Google Scholar
Shi, D. et al. Human-centred adaptive control of lower limb rehabilitation robot based on human–robot interaction dynamic model. Mech. Mach. Theory. 162, 104340 (2021).
Article Google Scholar
Aliabadi, M., Mashayekhifard, J. & Mohazabi, B. Intelligent and classic control of rehabilitation robot with robust Pid and fuzzy methods. Majlesi J. Mechatron. Syst. 9 (1), 31–36 (2020).
Google Scholar
Ma, Y., Wu, X. & Yi, J. A review on Human-Exoskeleton coordination towards lower limb robotic exoskeleton systems. Int. J. Rob. Autom. 34, 431–451 (2019).
Google Scholar
Yang, P. et al. Disturbance observer-based terminal sliding mode control of a 5-DOF upper-limb exoskeleton robot. IEEE Access. 7, 62833–62839 (2019).
Article Google Scholar
Rezage, G. A. & Tokhi, M. O. Fuzzy PID control of lower limb exoskeleton for elderly mobility. In 2016 IEEE International Conference on Automation, Quality and Testing, Robotics (AQTR) 1–6. (2016).
Yang, S. et al. An optimal fuzzy-theoretic setting of adaptive robust control design for a lower limb exoskeleton robot system. Mech. Syst. Signal Process. 141, 106706 (2020).
Article Google Scholar
Zhang, P., Zhang, J. & Zhang, Z. Design of RBFNN-based adaptive sliding mode control strategy for active rehabilitation robot. IEEE Access. 8, 155538–155547 (2020).
Article Google Scholar
Hu, N., Wang, A. & Wu, Y. Robust adaptive PD-like control of lower limb rehabilitation robot based on human movement data. PeerJ Comput. Sci. 7 (6), e394 (2021).
Article PubMed PubMed Central Google Scholar
Zhang, Q. et al. Adaptive sliding mode neural network control and flexible vibration suppression of a flexible Spatial parallel robot. Electronics 10 (2), 212 (2021).
Article Google Scholar
Sun, Z. et al. A novel RBF neural network-based iterative learning control for lower limb rehabilitation robot with strong robustness. In 2019 Chinese Control Conference (CCC), 4454–4459 (2019).
Chen, C. Research on Lower Limb Rehabilitation Exoskeleton Robot. D. China University of Petroleum(Beijing) (2020).
Diao, Y., Wei, H. & Yang, E. Application of plane expansion of Lagrange equation in mechanism dynamics. J. Harbin Eng. Univ. 20 (3), 91–96 (1999).
Google Scholar
Botvinick, M. et al. Deep reinforcement learning and its neuroscientific implications. Neuron 107 (4), 603–616 (2020).
Article CAS PubMed Google Scholar
Lillicrap, T. P. et al. Continuous control with deep reinforcement learning. Comput. Sci. 8 (6), A187 (2015).
Google Scholar
Xiong, H. & Diao, X. M. Safety robustness of reinforcement learning policies: A view from robust control. Neurocomputing 422, 12–21 (2021).
Article Google Scholar
Wen, Y. et al. Online reinforcement learning control for the personalization of a robotic knee prosthesis. IEEE Trans. Cybern. 50 (6), 2346–2356 (2019).
Article PubMed Google Scholar
Young, M. T. et al. Distributed bayesian optimization of deep reinforcement learning algorithms. J. Parallel Distrib. Comput. 139, 43–52 (2020).
Article Google Scholar
Luo, S. et al. Robust walking control of a lower limb rehabilitation exoskeleton coupled with a musculoskeletal model via deep reinforcement learning J. Neuroeng. Rehabil. 20 (1), 34 (2023).
Article PubMed PubMed Central Google Scholar
Malaek, S. M. & Abbasi, A. Near-optimal terrain collision avoidance trajectories using elevation maps. IEEE Trans. Aerosp. Electron. Syst. 47 (4), 2490–2501 (2011).
Article ADS Google Scholar
Sadeghi, M. et al. Numerical and experimental study of a wearable exo-glove for telerehabilitation application using shape memory alloy actuators[C]//Actuators. MDPI 13 (10), 409 (2024).
Google Scholar
Niestanak, V. D., Moshaii, A. A. & Moghaddam, M. M. A new underactuated mechanism of hand tendon injury rehabilitation. In 2017 5th RSI International Conference on Robotics and Mechatronics (ICRoM), 400–405 (IEEE, 2017).

Download references

Funding

This work was supported by: The Key RD Special Project of Henan Province (Nos.231111221600, 251111220900); The Science and Technology Innovation Teams in Higher Education Institutions of Henan Province (No.24IRTSTHN024); The Science and Technology Program Project of Henan Province (No.252102220006); The Foreign Expert Project (No.H20240988); The Science Technology Project of CNTAC (No.2025056).

Author information

Yingrui Jin and Jiahao Zhang contributed equally to this work.

Authors and Affiliations

School of Zhongyuan-Petersburg Aviation, Zhongyuan University of Technology, Zhengzhou, 450007, China
Yingrui Jin, Jun Yu, Zhaoyang Wang & Shuyi Sun
School of Electronic and Information Engineering, Zhongyuan University of Technology, Zhengzhou, 450007, China
Wei Li
School of Automation and Electrical Engineering, Zhongyuan University of Technology, Zhengzhou, 450007, China
Jiahao Zhang

Authors

Yingrui Jin
View author publications
Search author on:PubMed Google Scholar
Jiahao Zhang
View author publications
Search author on:PubMed Google Scholar
Wei Li
View author publications
Search author on:PubMed Google Scholar
Jun Yu
View author publications
Search author on:PubMed Google Scholar
Zhaoyang Wang
View author publications
Search author on:PubMed Google Scholar
Shuyi Sun
View author publications
Search author on:PubMed Google Scholar

Contributions

Y.J., J.Z., W.L., J.Y., and Zh.W. contributed to conception and design of the study, methodology, Y.J.,and J.Z.; simulation, W.L.; validation, J.Y., and Sh.S.; formal analysis, Y.J.,and J.Z., and W.L.; investigation, J.Y.; resources, W.L.; data curation, Zh.W.; writing (original draft preparation), J.Y.; writing (review and editing), J.Y.,and J.Z.; visualization, Zh.W.; supervision, J.Y.; project administration, J.Y.; funding acquisition, J.Y. All authors have read and agreed to the published version of the manuscript.

Corresponding authors

Correspondence to Yingrui Jin or Jun Yu.

Ethics declarations

Competing interests

The authors declare no competing interests.

Institutional review board statement

The study protocol was approved by the Institutional Review Board of Zhongyuan University of Technology.

Informed consent

Informed consent was obtained from all participants prior to data collection.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Below is the link to the electronic supplementary material.

Supplementary Material 1

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.

Reprints and permissions

About this article

Cite this article

Jin, Y., Zhang, J., Li, W. et al. A humanoid control strategy based on deep reinforcement learning for enhanced comfort in lower limb rehabilitation robots. Sci Rep 16, 7370 (2026). https://doi.org/10.1038/s41598-026-39011-7

Download citation

Received: 25 August 2025
Accepted: 02 February 2026
Published: 05 February 2026
Version of record: 20 February 2026
DOI: https://doi.org/10.1038/s41598-026-39011-7

A humanoid control strategy based on deep reinforcement learning for enhanced comfort in lower limb rehabilitation robots

Subjects

Abstract

Similar content being viewed by others

Bioinspired morphology and task curricula for learning locomotion in bipedal muscle-actuated systems

Lower-extremity inter-joint coordination variability in active individuals with transtibial amputation and healthy males during gait

Experiment-free exoskeleton assistance via learning in simulation

Introduction

The modeling of lower limb rehabilitation robot

The kinematic modeling of lower limb rehabilitation robot

Dynamic modeling based on Lagrangian

Design of model-based deep reinforcement learning PD controller

Deep reinforcement learning algorithms

PD controller design based on deep reinforcement learning

Experiment and simulation results

Establishment of experimental environment

Experimental results

Conclusions and outlook

Data availability

References

Funding

Author information

Authors and Affiliations

Contributions

Corresponding authors

Ethics declarations

Competing interests

Institutional review board statement

Informed consent

Additional information

Publisher’s note

Supplementary Information

Supplementary Material 1

Rights and permissions

About this article

Cite this article

Keywords

Search

Quick links

Subjects

Abstract

Similar content being viewed by others

Bioinspired morphology and task curricula for learning locomotion in bipedal muscle-actuated systems

Lower-extremity inter-joint coordination variability in active individuals with transtibial amputation and healthy males during gait

Experiment-free exoskeleton assistance via learning in simulation

Introduction

The modeling of lower limb rehabilitation robot

The kinematic modeling of lower limb rehabilitation robot

Dynamic modeling based on Lagrangian

Design of model-based deep reinforcement learning PD controller

Deep reinforcement learning algorithms

PD controller design based on deep reinforcement learning

Experiment and simulation results

Establishment of experimental environment

Experimental results

Conclusions and outlook

Data availability

References

Funding

Author information

Authors and Affiliations

Contributions

Corresponding authors

Ethics declarations

Competing interests

Institutional review board statement

Informed consent

Additional information

Publisher’s note

Supplementary Information

Supplementary Material 1

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Quick links