Introduction

Recently, the number of patients suffering from lower limb functional impairments, caused by conditions like apoplexy and spinal cord injury, has been rapidly on the rise, particularly among the elderly population. Scholars have confirmed that robot-assisted rehabilitation has positive effects on the recovery of patients’ motor ability1,2,3,4. The interactive control, which is based on force information, is achieved by incorporating the recovery patient as a component of the control system5. The methods of combining control technology with artificial intelligence have emerged as research challenges for rehabilitation robots, attracting considerable attention from numerous scholars6,7,8.

Rehabilitation robot control is a nonlinear control. Several researchers have suggested different control strategies, including PD control9, sliding mode control10, fuzzy control11, etc. The dynamics of rehabilitation robots exhibit highly nonlinear characteristics, which pose challenges in establishing accurate mathematical models. Hu et al. used a three-dimensional motion capture system to collect human gait data and designed a robust adaptive class PD controller12. The PD control, as a classical control strategy for robot control13-14, is widely used due to its simplicity. It is often combined with other controllers to achieve the desired control effect. Reinforcement learning is a machine learning algorithm15. Firstly, an agent evaluates the value function of the current policy it adopts. Secondly, based on the evaluation of the value function, the agent updates its policy. The agent acquires the optimal policy through continuous interaction with the environment16. In the context of rehabilitation robot research, combining DRL g with traditional control can enhance the intelligence level of the control system. Therefore, we adopted this control strategy in this paper.

Recently, DRL has emerged as a powerful framework for solving complex control problems in robotics. Notably, studies such as Robust walking control.musculoskeletal model24 have coupled DRL with biomechanical models for more natural gait generation. Furthermore, advancements in wearable devices Numerical and experimental study. wearable exo-glove26 and terrain adaptation Near-optimal terrain collision avoidance25 highlight the trend towards intelligent and adaptive rehabilitation systems. Our work builds upon these developments by integrating DRL with traditional PD control to capture human gait characteristics directly from motion capture data, aiming to enhance comfort and adaptability.

In the dynamic modeling of the lower limb rehabilitation robots, the Lagrange formulation and the Newton-Euler formulation were used commonly17. The Newton–Euler formulation not only requires solving the mechanical equations of the links but also involves calculating the interaction forces between the links, which increases the computational difficulty and modeling complexity. The Lagrange formulation analyzes robot dynamics from the perspective of energy variations, establishing appropriate generalized coordinate systems for the robot and considering the kinetic and potential energies to determine its motion and interactions18. The Lagrange formulation avoids the analysis of internal forces within the entire robot and is suitable for analyzing human lower limb motion from the perspective of energy metabolism. This approach has been adopted in this paper.

This study utilizes a control strategy combining DRL with PD control. Leveraging a substantial dataset of 100 gait cycles from five subjects, we demonstrate that the strategy exhibits not only good effectiveness but also robust performance across diverse gait characteristics.This paper is organized as follows: In “The modeling of lower limb rehabilitation robot”, the modeling of the lower limb rehabilitation robot is introduced. The Lagrange formulation has been adopted to analyze the human lower limb motion. In “Design of model-based deep reinforcement learning PD controller”, the PD Controller based on DRL is designed. The experiment and simulation results are given in “Experiment and simulation results”. Finally, the conclusions are drawn in “Conclusions and outlook”.

The modeling of lower limb rehabilitation robot

The kinematic modeling of lower limb rehabilitation robot

For DRL, we need to build a suitable robot model. In this paper, a three-link two-dimensional robot model (as shown in Fig. 1) was built, and its motion equation was solved. The motion equation is established by using the energy constraints of the model, which is suitable for describing the robot’s walking.

Fig. 1
Fig. 1The alternative text for this image may have been generated using AI.
Full size image

A three-link two-dimensional robot model.

In this model, the joint angles are defined as follows: \(\:{\text{q}}_{1}\) represents the knee joint angle,\(\:\:{\text{q}}_{2}\) represents the hip joint angle, and \(\:{\text{q}}_{3}\) represents the torso inclination angle.

Among the equations, \(\:l\) represents the length of each limb of the robot, \(\:m\) is the weight of each limb of the robot, \(\:{q}_{1},{q}_{2},{q}_{3}\) are the joint angles respectively. Simplify multiple joints of the lower limbs into a two-dimensional model of a three-link, which greatly reduces computational cost and difficulty.

To obtain gait data, the reflex marker was fixed to the outside of the subject’s joints, and the subject was required to walk freely in the built-up 3D motion capture space. Then record the coordinate information of the reflection marks of each joint and the data of the ground reaction force. Then you can calculate the angle change of each joint, that is as the Eqs. (1)–(3):

$$\:{q}_{1}={\text{tan}}^{-1}\left(\frac{ank\_R\_x-hip\_x}{hip\_z-ank\_R\_z}\right)$$
(1)
$$\:{q}_{2}={\text{tan}}^{-1}\left(\frac{hip\_x-ank\_L\_x}{hip\_z-ank\_L\_z}\right)$$
(2)
$$\:{q}_{3}={\text{tan}}^{-1}\left(\frac{head\_x-hip\_x}{head\_z-hip\_z}\right)(ank\_R\_x,ank\_R\_z)$$
(3)

Among them, \(\:\left(ank\_L\_x,ank\_L\_z\right)\), \(\:\left(ank\_R\_x,ank\_R\_z\right)\), \(\:\left(hip\_x,hip\_z\right)\), and \(\:\left(head\_x,head\_z\right)\) are the coordinate of ankle joint, hip joint, and torso respectively.

This study is based on gait data obtained from five healthy subjects. The key kinematic and spatiotemporal parameters were analyzed for the entire dataset. To exemplify the specific parameters extracted and their typical ranges, the data from one subject (67.5 kg, 1.68 m) are detailed in Table 1.

Table 1 Some walking characteristics of the subjects.

According to the above formula, the angle of \(\:{q}_{1}, {q}_{2}\) and \(\:{q}_{3}\) changed. The angles of \(\:{q}_{1}\) and \(\:{q}_{2}\) changed into obvious periodicity and the symmetry of alternating left and right legs, and all change within a certain range. However, there is no obvious periodicity in the angle change of \(\:{q}_{3}\) but it is only a smooth change in a small range.

For the spatial position of the joints, we are interested in the height changes of the hip and ankle joints (as shown in Fig. 2). The height of the hip joint and ankle joint both show periodic reciprocating movement within a certain range.

Fig. 2
Fig. 2The alternative text for this image may have been generated using AI.
Full size image

The height changes of the hip and ankle joints.

Figure 2 depicts the vertical displacement of the hip and ankle joints (derived from motion capture data of healthy participants) over a single gait cycle.

The three-link model is used for gait data fitting and simulation validation to better approximate the human-robot system. The theoretical derivation based on a two-link model simplifies the problem while capturing the core dynamics for controller design and stability analysis.

Dynamic modeling based on Lagrangian

In the dynamic modeling of the lower limb rehabilitation robots, the Lagrange formulation has been adopted in this paper to analyze the human lower limb motion from the perspective of energy metabolism.

The form of the Lagrange function is as Eq. (4),

$$\:L\:=\:E_k\:-\:E_p$$
(4)

where \(\:{E}_{k}\) is the total kinetic energy and, \(\:{E}_{p}\) is the total potential energy.

The Lagrange equation is,

$$\:\tau\:\:=\frac{d}{dt}\frac{\partial\:l}{\partial\:\dot{q}}-\frac{\partial\:l}{\partial\:q}$$
(5)

Connecting rod speed \(\:{v}_{d1},{v}_{d2}\), The kinetic energy of the thigh and calf,

$$\:{E}_{k1}=\frac{1}{2}{m}_{1}{v}_{d1}^{2},{E}_{k2}=\frac{1}{2}{m}_{2}{v}_{d2}^{2}$$
(6)

Then, the total kinetic energy of the two-link model is,

$$\:{E}_{k}={E}_{k1}+{E}_{k2}=\frac{1}{2}{m}_{2}({l}_{1}{\dot{q}}_{hip}{)}^{2}+\frac{1}{2}{m}_{2}{l}_{1}{l}_{2}{\dot{q}}_{hip}({\dot{q}}_{hip}-{\dot{q}}_{knee}){cos}{q}_{knee}+\frac{1}{8}{m}_{1}({l}_{1}{\dot{q}}_{hip}{)}^{2}+\frac{1}{8}{m}_{2}\left({l}_{2}\right({\dot{q}}_{hip}-{\dot{q}}_{knee}){)}^{2}$$
(7)

The potential energies of the thigh and calf are, respectively,

$$\:{E}_{p1}={m}_{1}g{y}_{d1},{E}_{p2}={m}_{2}g{y}_{d2}$$
(8)

Then, the total potential energy of the two-link model is,

$$\:{E}_{p}={E}_{p1}+{E}_{p2}={m}_{2}g{l}_{1}{cos}{q}_{hip}+\frac{1}{2}{m}_{1}g{l}_{1}{cos}{q}_{hip}+\frac{1}{2}{m}_{2}g{l}_{2}{cos}({q}_{hip}-{q}_{knee})$$
(9)

Then establish the Lagrange equation,

$$\:{\tau\:}_{hip}=\frac{d}{dt}\frac{\partial\:l}{\partial\:{\dot{q}}_{hip}}-\frac{\partial\:l}{\partial\:{q}_{hip}}$$
(10)
$$\:{\tau\:}_{knee}=\frac{d}{dt}\frac{\partial\:l}{\partial\:{\dot{q}}_{knee}}-\frac{\partial\:l}{\partial\:{q}_{knee}}$$
(11)

Here, Eqs. (10) and (11) are the specific applications of the general Lagrange Eq. (5) to the hip and knee joints, respectively. They describe how the joint torques \(\:{\tau\:}_{hip}\:\)and \(\:{\tau\:}_{knee}\) are derived from the partial derivatives of the Lagrangian L(kinetic energy minus potential energy) with respect to the joint angles qand their velocities dq.

By finishing Eqs. (7), (9), (10) and (11) the dynamic equation of the two-link model can be obtained,

$$\:\varvec{M}\left(\varvec{q}\right)\ddot{q}+\varvec{C}(\varvec{q},\dot{\varvec{q}})\dot{q}+\varvec{G}\left(\varvec{q}\right)=\varvec{\tau\:}$$
(12)

where \(\:\varvec{M}\left(\varvec{q}\right)\) is the inertia matrix, \(\:\varvec{C}(\varvec{q},\dot{\varvec{q}})\) is the centripetal force matrix, \(\:\varvec{G}\left(\varvec{q}\right)\) is the gravity matrix, \(\:\varvec{\tau\:}\) is the driving torque of the robot.

To balance the complexity of theoretical derivation and the fidelity of simulation verification, this study adopts models with varying levels of complexity. In the dynamic modeling based on the Lagrangian method, a two-link model is employed for derivation. This model features a simple structure, which can clearly demonstrate the energy relationship of the system and facilitate the stability analysis of the controller (Eq. 12).

In the simulation verification presented in “Experiment and simulation results”, to better approximate the real human gait, the three-link model described in “The kinematic modeling of lower limb rehabilitation robot” (Fig. 1) was adopted. There is a clear correspondence between the two models in terms of joint variables: the hip and knee joints in the two-link model correspond to the variable combinations o\(\:{q}_{2}\) and \(\:{q}_{1}\) in the three-link model, respectively. This correspondence ensures the coherence and consistency from dynamic theory to simulation experiments.

Design of model-based deep reinforcement learning PD controller

In this work, the Deep Deterministic Policy Gradient (DDPG) algorithm is adopted due to its suitability for continuous control tasks.As part of machine learning, Reinforcement learning is an optimal strategy for learning \(\:\pi\:\) Maximize cumulative rewards. i.e. for time \(\:t\), the agent receiving the observed value \(\:{O}_{t}\) of the state \(\:{S}_{t}\), simultaneous strategy \(\:\pi\:\left({a}_{t}|{s}_{t}\right)\) gives an action \(\:{a}_{t}\) in response to an observation and earns rewards \(\:{r}_{t}=\left({s}_{t},{a}_{t},{s}_{t+1}\right)\) and a new state \(\:{s}_{t+1}\), then repeats this process in order to get the largest cumulative reward. Figure 3 shows a general model for reinforcement learning.

Fig. 3
Fig. 3The alternative text for this image may have been generated using AI.
Full size image

A general model for reinforcement learning.

Deep reinforcement learning algorithms

DRL aims to build a comprehensive framework for the interplay between learning, representation, and decision making19. At the same time, it also circumvents the limitations of traditional reinforcement learning. DRL enables agents to use neural networks to represent policies \(\:{\pi\:}_{\theta\:}\left(s\right)\) and to make decisions from high-dimensional and unstructured input data. Among them, deep Q-learning directly uses the original high-dimensional data for reinforcement learning, and at the same time uses stochastic gradient descent for training, which ensures the stability of the training process and has good convergence characteristics. Moreover, it has stronger sample efficiency in high-dimensional state-action space, and can learn random strategies, which improves the robustness of the algorithm. In the Deep Q-Network (DQN) algorithm, the optimization metric function is defined by \(\:J\left(\theta\:\right)\), make the parameter is adjusted towards j the direction of maximization, and finally the optimal strategy is given. The formula for Markov decision process (MDP) is shown in Eqs. (13) and (14). Although it is easier to converge to a local optimum, it is accompanied by a larger variance.

$$J(\theta )={E_{{\pi _\theta }}}[r]=\sum\limits_{{s \in S}} {d(s)} \sum\limits_{{a \in A}} {{\pi _\theta }(s,a)} {R_{s,a}}$$
(13)
$${\Delta _\theta }J(\theta )=\sum\limits_{{s \in S}} {d(s)} \sum\limits_{{a \in A}} {{\pi _\theta }(s,a){\Delta _\theta }\log {\pi _\theta }(s,a){R_{s,a}}}$$
(14)

Among the Eq. (13), \(\:d\left(s\right)\) is the probability distribution of random state \(\:a\), \(\:{R}_{s,a}\) is the reward at state\(\:\:a\) after acting \(\:s\). The gradient of \(\:J\) can be obtained using the strategy gradient method of Eq. (14), and the value function approximation technique can directly approximate \(\:{Q}^{{\pi\:}_{\theta\:}}\left(s,a\right)\), as shown in Eq. (15).

$${\Delta _\theta }{\pi _\theta }(s,a)={\pi _\theta }(s,a)\frac{{{\Delta _\theta }{\pi _\theta }(s,a)}}{{{\pi _\theta }(s,a)}}={\Delta _\theta }\log {\pi _\theta }(s,a)$$
(15)
$${\Delta _\theta }J(w)={E_{{\pi _w}}}[{\Delta _w}\log {\pi _w}(s,a){Q^{{\pi _w}}}(s,a)]$$
(16)

A robot is a continuous, nonlinear, and strongly coupled system. The deep deterministic policy gradient (DDPG) algorithm20 uses an actor-critic architecture that can handle off-policy data. The DDPG algorithm includes Actor network and Critic network. Among them, the Critic network is used to approximate, and the Actor network is used to approximate \(\:{\pi\:}_{\theta\:}\). Specifically, Actor maps the state of the system to actions, and Critic updates this mapping to generate an increase \(\:Q\)value action. The network uses the approximation capabilities of neural networks to update the parameters of the action-value function \(\:w\), Parameters of the actor network update strategy \(\:\theta\:\). However, the Actor-Critic method produces approximation errors when using neural network approximation, and its approximate policy gradient is as Eqs. (17)–(19),

$$\:{Q}_{w}(s,a)\approx\:{Q}^{{\pi\:}_{\theta\:}}(s,a)$$
(17)
$$\:{\varDelta\:}_{\theta\:}J\left(\theta\:\right)\approx\:{E}_{{\pi\:}_{\theta\:}}\left[{\varDelta\:}_{\theta\:}{log}{\pi\:}_{\theta\:}\right(s,a\left){Q}_{w}\right(s,a\left)\right]$$
(18)
$$\:\varDelta\:\theta\:=\alpha\:{\varDelta\:}_{\theta\:}{log}{\pi\:}_{\theta\:}(s,a){Q}_{w}(s,a)$$
(19)

The DDPG does not use recent data to update the policy but records the data in a buffer for later use when updating the policy. This algorithm does not need any environment transition model to find the optimal policy, it only needs (state, action, next state, reward) tuples. So better tuning strategies can be achieved.

PD controller design based on deep reinforcement learning

The control strategy proposed in this study adopts a collaborative architecture integrating a deep reinforcement learning (DRL) agent and a proportional-derivative (PD) controller. As illustrated in Fig. 4, within the overall system framework, the DRL agent acts as the upper-layer decision-maker, which is responsible for learning human gait characteristics and generating adaptive control actions; the PD controller functions as the lower-layer tracker, ensuring the joint angles accurately track the reference trajectories.

Based on the established model and the derived Eq. (12), consider the total friction of the robot, the kinetic equation of the model is:

$$\:\varvec{M}\left(\varvec{q}\right)\ddot{q}+\varvec{C}(\varvec{q},\dot{\varvec{q}})\dot{q}+\varvec{G}\left(\varvec{q}\right)+{F}_{v}=\tau\:=\varvec{B}\varvec{u}$$
(20)

Among them: \(\:\varvec{M}\left(\varvec{q}\right)\) is the positive definite inertia matrix, \(\:\varvec{C}(\varvec{q},\dot{\varvec{q}})\) is the matrix of the Coriolis force and centrifugal force, \(\:\varvec{G}\left(\varvec{q}\right)={\left[{G}_{11},{G}_{12},{G}_{13}\right]}^{T}\) is the matrix of the Gravity, \(\:{F}_{v}\) is the total friction of the robot, \(\:\tau\:\) is the torque, \(\:\varvec{u}\) is the driving torque, \(\:\varvec{B}\) is the transform matrix.

$$\:\tau\:={\varvec{k}}_{\varvec{p}}e+{\varvec{k}}_{\varvec{d}}\dot{e}$$
(21)

Among them: \(\:{\varvec{k}}_{\varvec{p}},{\varvec{k}}_{\varvec{d}}\) are both the positive symmetry matrix, \(\:\varvec{e}={\varvec{q}}_{\varvec{d}}-\varvec{q}\), \(\:{\varvec{q}}_{\varvec{d}}\) is the desired joint angle. Combined Eqs. (20) and (21), Eq. (22) can be obtained:

$$\:\varvec{M}\left(\varvec{q}\right)({\ddot{q}}_{d}-\ddot{q})+\varvec{C}(\varvec{q},\dot{\varvec{q}})({\dot{q}}_{d}-\dot{q})+{K}_{d}\dot{e}+{K}_{p}e=0$$
(22)

It can also be expressed as: \(\:\varvec{M}\left(\varvec{q}\right)\:\ddot{e}+\varvec{C}\left(\varvec{q},\dot{\varvec{q}}\right)\:\dot{e}+{K}_{p}e=-{K}_{d}\dot{e}\).

Consider the candidate Lyapunov function:

$$\:V=\frac{1}{2}{\dot{e}}^{T}M\left(q\right)\dot{e}+\frac{1}{2}{\dot{e}}^{T}{K}_{d}\dot{e}$$
(23)

This equation is positive definite. Its derivative of time is as the following Eq. (24):

$$\:\dot{V}={\dot{e}}^{T}M\ddot{e}+\frac{1}{2}{\dot{e}}^{T}M\left(q\right)\dot{e}+{\dot{e}}^{T}{K}_{d}e$$
(24)

There is an oblique symmetrical feature matrix which satisfy: \(\:\dot{M}\left(q\right)-2C(q,\dot{q})=0\), substitution into the Eq. (24), and obtained:

$$\:\dot{V}={\dot{e}}^{T}(M\ddot{e}+D\dot{e}+{K}_{d}e)=-{\dot{e}}^{T}{K}_{d}\dot{e}\le\:0$$
(25)

Thus, it is ensured that the control system to be designed meets the stability in the sense of Lyapunov.

This system controls the hip and knee joints of the robot, with a total of 2 Degrees of Freedom (DOF) and equipped with two actuators, thus being a fully actuated system. This provides a foundation for precise trajectory tracking and attitude control. The gait control method of the robot proposed in this section is to apply the driving torques of the two hip joints to the robot in accordance with a specific gait, thereby driving the robot to run at the desired gait. It uses two constraints, which are functions of the model coordinates (the angular trajectory of the hips, posture, and swinging feet, and the required tilt angle). To improve the stability of the robot, PD control has been introduced. The PD controller is given by the following Eq. (26):

$$\:u={\varvec{k}}_{\varvec{p}}y+{\varvec{k}}_{\varvec{d}}\dot{y}$$
(26)

Among them, the control variable \(\:y=[{y}_{1},{y}_{2}]\) is defined as \(\:{y}_{1}={q}_{hip}-{q}_{hip,d}\) and \(\:{y}_{2}={q}_{knee}-{q}_{knee,d}\), i.e., the tracking errors of the joint angles.

The DRL controller is designed with the following key components:

State Space (s): The state observed by the agent includes the joint angles and velocities, defined as s = [\(\:{q}_{hip}\), \(\:{q}_{knee}\), \(\:{dq}_{hip}\), \(\:{dq}_{knee}\)].

Action Space (a): The agent outputs the desired joint torques, a = [\(\:{u}_{hip}\), \(\:{u}_{knee}\)], which are applied to the robot model.

Reward Function (r): As defined in Eq. (27), the reward ris a composite function designed to encourage desired walking behavior while ensuring safety and comfort. Key terms include \(\:{r}_{speed}\) for velocity tracking, and penalty terms like \(\:{p}_{effort}\)(Eq. 28) for minimizing joint torques to enhance comfort.

Network Architecture: The DDPG algorithm employs an Actor-Critic architecture. Both Actor and Critic networks are fully connected neural networks with two hidden layers (256 and 128 neurons, respectively) and ReLU activation functions.

Training Details: The training was conducted using the MATLAB DRL Toolbox, as described in “PD controller design based on deep reinforcement learning”. Key hyperparameters include: a learning rate of 1e−4 for both networks, a discount factor of 0.99, a replay buffer size of 1e6, and a mini-batch size of 128. The agent was trained for up to 10,000 episodes.”

Fig. 4
Fig. 4The alternative text for this image may have been generated using AI.
Full size image

The overall system framework.

The agent is composed of actor network and critic network. In Fig. 5, we observe this system in more detail. The actor gets the current status \(\:q\), \(\:{d}_{q}\) and gives actions \(\:{u}_{1}\),\(\:\:{u}_{2}\). The critic obtains the status and the actions, then estimates the value of the reward function \(\:{Q}_{0}\) in advance. The environment represents the real physical world where the robot is trying to walk. The robot obtains the action and calculates the next state, as well as the reward function in the new state. The agent is optimized by minimize the loss function. Then, it propagates a gradient back to the critic and the actor to optimize the two networks.

Fig. 5
Fig. 5The alternative text for this image may have been generated using AI.
Full size image

The composition of the agent.

Rewards are defined through an iterative process. The reward function is given by the following Eq. (27), where \(\:r\) represents the reward value and \(\:p\) represents the penalty value.

$$\:Q={r}_{speed}-{p}_{trunk}+{r}_{survival}-{p}_{zhip}-{p}_{za}-{p}_{pd}-{p}_{effort}$$
(27)

The design of the reward function fully incorporates the safety and movement quality of rehabilitation training. Specifically,\(\:{r}_{speed}\) encourages a natural walking pace, while multiple penalty terms directly safeguard safety:\(\:{\:p}_{trunk}\) is used to maintain trunk uprightness and prevent falls; \(\:{p}_{zhip}\) stabilizes the height of the hip joints; \(\:{p}_{za}\) ensures sufficient ground clearance for the swing foot to avoid tripping; \(\:{p}_{effort}\) penalizes excessive torques, aiming to generate smooth and comfortable movements and reduce patient discomfort. The weight settings are based on the analysis of healthy gait characteristics and parameter tuning.

a key comfort-oriented penalty \(\:{p}_{effort}\) defined as the normalized squared magnitude of the joint torques:

$$\:{p}_{effort}=\eta\:\sum\:_{i=1}^{n}{\tau\:}_{i}^{2}$$
(28)

where \(\:{\uptau\:}\) is the torque at joint i, n is the number of actuated joints, and η is a weighting coefficient. Minimizing this term directly minimizes the interaction forces, which is our primary metric for physical comfort.”

Furthermore, to promote gait naturalness—a subjective aspect of comfort—we incorporate a penalty p_pdfor deviations from the reference human joint angles.\(\:{p}_{pd}\) constraint is used in the PD controller, so the agent tries to imitate the angle of the robot’s legs. With the reward function, the agent can produce many normal gaits. \(\:{p}_{effort}\) is a punishment for strong moments to avoid many strong moments. The reward rules for each item of the reward function are designed according to the characteristics of human walking, so that the final training result has a certain imitation of human nature. Once the appropriate environment and rewards are established, the agents can be trained.

The training process was conducted using the MATLAB DRL Toolbox. The agent was trained for a maximum of 10,000 episodes. The training was set to stop early if the average reward over 100 consecutive episodes exceeded a threshold of 250 or if the maximum number of episodes was reached, which served as the convergence criterion. Key hyperparameters include: a learning rate of 1e-4 for both actor and critic networks, a discount factor of 0.99, a replay buffer size of 1e6, and a mini-batch size of 128.

Experiment and simulation results

Establishment of experimental environment

The Nokov motion capture system was used for gait data collection. Reflective markers were placed on key anatomical landmarks of the subject according to a conventional gait model. The subject walked at a self-selected comfortable speed on a level walkway. The motion capture system synchronized with a floor-embedded 3D force plate to record ground reaction forces simultaneously. The scenario is illustrated in Fig. 6.

By utilizing an AD acquisition card and a 3D force measurement platform, the system can also synchronize the measurement of the changes in plantar force during human walking.This optical capture system provides high-accuracy kinematic data, which is crucial for reliable modeling and controller design, distinguishing it from vision-based estimation methods that may suffer from simplification errors.The scenario for gait data acquisition is shown in Fig. 6. The required human gait data in this study includes the human trunk, hip joints, knee joints, ankle joints, foot surface, and the ground reaction forces during walking. The lower limb rehabilitation robot used for training is shown in Fig. 7.

All experiments were performed in accordance with the Declaration of Helsinki and relevant guidelines. The study protocol was approved by the Institutional Review Board of Zhongyuan University of Technology. Informed consent was obtained from all participants prior to data collection.

Fig. 6
Fig. 6The alternative text for this image may have been generated using AI.
Full size image

The scenario for gait data acquisition.

Fig. 7
Fig. 7The alternative text for this image may have been generated using AI.
Full size image

The lower limb rehabilitation robot.

Experimental results

In the DDPG algorithm, the evolution of rewards often exhibits fluctuations. Some factors that yield the highest rewards do not always lead to the expected gait in the most stable manner. However, the desired outcomes only occur when the average reward is high. During the training, the robot walks with a normal gait for a duration of 10 s. Therefore, it can be assumed that the robot can continue walking without falling. The training results are illustrated in Fig. 8.

Fig. 8
Fig. 8The alternative text for this image may have been generated using AI.
Full size image

The angle changes of knee, hip, and ankle joints.

The joint angles of the robot, namely θ1, θ2, and θ3, exhibit periodic variations within the ranges of − 0.3 rad to 0.3 rad, − 0.3 rad to 0.4 rad, and 0 rad to 0.1 rad, respectively, showing some similarities with human joint angle variations. Additionally, during the walking process, the robot’s hip joint and foot sole exhibit small periodic changes along the Z-axis, resembling human walking characteristics.

From the perspective of limit cycles, the limit cycles of each joint angle form closed trajectories, representing the overall stability of the entire robot system. Therefore, applying DRL in the later stages of rehabilitation training can capture the characteristics of human walking, making the trained effects more human-like and achieving better rehabilitation outcomes.

The gait trajectory of the same subject was used as the reference trajectory. The performance indicators for joint angle tracking are shown in Fig. 9a,b. The results demonstrate that the hip and knee joints of the robot successfully track the desired trajectories. It performs well in all performance indicators, exhibiting good transient response characteristics, damping characteristics, and tracking performance. However, the hip joint shows slightly poorer transient response characteristics and damping properties.

Moreover, The experimental results of our proposed DRL-PD controller are presented in Fig. 9a,b.The graphs demonstrate that the hip and knee joints successfully track the desired trajectories with high accuracy. Notably, the tracking errors converge to less than 0.1 rad rapidly and maintain stability, showing excellent transient response and damping characteristics. It can also be observed that the joint torques of the hip and knee joints remain within stable ranges, specifically in the ranges of \(\left( {0\sim 50{\rm{N}}\,{\rm{m}}} \right)\), \(\:\left(-10\text{N}\sim60\text{N}\,\text{m}\right)\). The tracking performance was evaluated over 10 stable gait cycles. The mean and standard deviation of the RMSE for the hip joint are 0.028 ± 0.003 radand for the knee joint are 0.035 ± 0.004 rad. The small standard deviations indicate the controller’s consistent and reliable performance across multiple cycles.

Fig. 9
Fig. 9The alternative text for this image may have been generated using AI.
Full size image

The experimental results. (a) Tracking of joint angles. (b) Tracking of joint velocities.

On the basis of the aforementioned results, a comparative analysis can be conducted between the performance of the proposed controller and that of existing control methods. Although Fig. 9 only presents the experimental data of the method proposed herein, the key performance indicators derived from the results—including a steady-state error of less than 0.1 rad, fast convergence rate, and smooth torque output profiles (e.g., the amplitude of the hip joint torque remains stable within the range of 0–50 N·m)—can be qualitatively compared with the performance of traditional controllers reported in existing literature.

For instance, relevant studies on standard PD controllers9 have indicated that such controllers tend to suffer from larger oscillation amplitude and slower convergence rate when dealing with the nonlinear dynamic characteristics of human-robot systems. In contrast, while fuzzy controllers11 possess a certain degree of adaptability, their control performance heavily relies on pre-defined rule bases. As a result, they cannot achieve the same control accuracy as the data-driven DRL-PD controller proposed in this paper in complex human-like trajectory tracking tasks.

In addition, To provide a more granular metric of smoothness, we calculate the root mean square (RMS) of the joint torque derivatives for our method and compare it with values representative of a standard PD controller9 applied to the same task. A lower RMS value indicates smoother torque transitions. The results, summarized in Table 2 below, show a significant reduction in torque variability for the DRL-PD controller, strongly supporting the claim of enhanced comfort.

Table 2 Statistical performance of the controller across all subjects and trials.

The results in Table 2 lead to two key conclusions: First, the proposed DRL-PD controller significantly outperforms the standard PD controller, reducing tracking error by approximately 60%. More importantly, the exceptionally small standard deviations​ observed for each subject and across all trials provide strong statistical evidence for the controller’s consistency and its robustness to inter-subject variations.

An initial sensitivity analysis was conducted to assess the controller’s robustness to model uncertainties. By introducing a ± 5% variation in the link masses of the dynamic model, the tracking performance was re-evaluated. The results showed that the RMSE of the hip joint increased by less than 8%, and the knee joint by less than 10%, while the system remained stable. This indicates that the proposed controller possesses a degree of robustness against parameter variations, which is important for practical applications where model parameters may not be perfectly known.

Regarding computational efficiency, the training process for the DRL agent was computationally intensive, requiring approximately 12 h on a PC with an NVIDIA GeForce RTX 3060 GPU. This represents a current limitation for real-time deployment. However, once trained, the policy network can be executed efficiently. Future work will focus on policy distillation and model compression techniques to achieve real-time control on embedded systems with limited computational resources.

Conclusions and outlook

This paper designs a strategy based on real human gait data, aiming to improve the humanoid and intelligent capabilities of the robot, and optimizes the controllers. Firstly, to obtain human gait data, a 3D motion capture system is employed, and a gait data collection space is set up. Gait data of the subjects are collected and subjected to filtering and computation processes to obtain joint angles and variations in plantar forces.After that, an expected trajectory for rehabilitation training is fitted. This provides data references for designing the controllers, enhancing their humanoid characteristics, and supporting DRL. Lastly, considering the diversity of human gait, pre-planned gait trajectories may not accommodate all requirements. Therefore, a control strategy combining DRL with PD control is proposed. This strategy adheres to the concept of humanoid robot control and establishes a DRL framework for training the robot based on human gait characteristics. The acquired gait motion data is thoroughly analyzed to extract human gait features. The DDPG algorithm is employed to establish the DRL framework for continuous and high-dimensional, strongly coupled robot systems. The controller’s parameters are optimized, and simulation experiments are conducted to validate the effectiveness of the algorithm, achieving a certain degree of humanoid behavior, and ensuring the safety of robot motion.The proposed method enhances comfort by generating smoother interaction torques, as quantified and demonstrated in the results Nevertheless, with the help of an effective reward function, the trained agent will become powerful and exhibit human-like lower limb motion characteristics. After training, the robot’s gait closely resembles human gait, and most importantly, it can imitate the diversity of human walking.

The enhanced comfort and human-like gait generated by our strategy are expected to improve patient engagement and reduce training anxiety, which are crucial for rehabilitation efficacy. This work provides a foundational step toward personalized, adaptive robot-assisted therapy.

This study has limitations that point to clear future research directions.The simulation-based validation calls for long-term safety and effectiveness studies​ in real-world settings. Furthermore, addressing real-world implementation challenges, such as integration with clinical protocols and computational efficiency, will be critical for practical adoption. Future work will focus on these areas to enhance the strategy’s scalability and clinical impact.