The present disclosure relates to the field of exoskeleton type robots.
More specifically, it relates to methods for training a neural network and for using said neural network to stabilize a bipedal robot such as an exoskeleton.
Recently, for persons with significant mobility problems such as paraplegics, assisted walking devices called exoskeletons have appeared, which are external robotised devices that the operator (the human user) “slips on” thanks to a system of fasteners which links the movements of the exoskeleton with his own movements. The exoskeletons of lower limbs have several joints, generally at least at the level of the knees and the hips, to reproduce the walking movement. Actuators make it possible to move these joints, which in their turn make the operator move. An interface system allows the operator to give orders to the exoskeleton, and a control system transforms these orders into commands for the actuators. Sensors generally complete the device.
These exoskeletons constitute an advance compared to wheelchairs, because they allow operators to get back on their feet and to walk. Exoskeletons are no longer limited by wheels and can theoretically evolve in the majority of non-flat environments: wheels, unlike legs, do not make it possible to clear significant obstacles such as steps, stairs, obstacles that are too high, etc.
However, achieving dynamic stability for such exoskeletons is still a challenge. Continuous feedback control is required to keep balance since the vertical posture is inherently unstable. Trajectory planning for bipedal robots has been solved successfully through whole-body optimization, and stable walking on sensibly flat ground and without major disturbances was achieved on the exoskeleton Atalante of the present applicant by using advanced traditional planning approaches, see for instance the patents applications EP3568266 and FR3101463, or the document T. Gurriet, S. Finet, G. Boeris, A. Duburcq, A. Hereid, O. Harib, M. Masselin, J. Grizzle, and A. D. Ames, “Towards restoring locomotion for paraplegics: Realizing dynamically stable walking on exoskeletons,” in 2018 IEEE International Conference on Robotics and Automation (ICRA), 2018, pp. 2804-2811.
Yet, so-called “emergency recovery” is still an open problem. In more details, while the exoskeleton is perfectly able to keep a smooth and stable trajectory while walking, it may be thrown off balance if it undergoes a strong and unexpected disturbance like a push, which could result in the human operator being hurt and/or the robot being damaged.
For small perturbations, in-place recovery strategies controlling the Center of Pressure (CoP), the centroidal angular momentum, or using foot-tilting are sufficient.
To handle stronger perturbations, controllers based on Zero-Moment Point (ZMP) trajectory generation have been proposed, along with Model Predictive Control (MPC) methods controlling the ZMP, but in practice the efficiency was very limited.
Consequently, there is still a need for a purely reactive controller for standing push recovery on exoskeletons, to be used as the last resort fallback in case of emergency, and able to feature a variety of human-like balancing strategies while guaranteeing predictable, safe and smooth behavior.
For these purposes, the present disclosure provides, according to a first aspect, a method for training a neural network for stabilizing a bipedal robot presenting a plurality of degrees of freedom actuated by actuators, the method being characterised in that it comprises the implementation by the data processing means of a server of steps of:
Preferred but non limiting features are as follow:
Said simulation lasts for a predetermined duration, steps (a) and (b) being repeated for a plurality of simulations.
Pushes are applied periodically over the duration of the simulation, with forces of constant magnitude applied for a predetermined duration.
Pushes are applied on a pelvis of the virtual twin of the robot with an orientation sampled from a spherical distribution.
At least one terminal condition on the virtual twin of the robot is enforced during step (b).
Said terminal condition is chosen among a range of positions and/or orientations of a pelvis of the exoskeleton, a minimal distance between feet of the robot, a range of positions and/or velocities of the actuators, a maximum difference with an expected trajectory, a maximum recovery duration, and a maximum power consumption.
Said simulation outputs a state of the virtual twin of the robot as a function of the pushes and the commands provided by the neural network.
The robot comprises at least one sensor for observing the state of the robot, wherein the neural network takes as input in step (b) the state of the virtual twin of the robot as outputted by the simulation.
The neural network provides as commands target positions and/or velocities of the actuators, and a control loop mechanism determines torques to be applied by the actuators as a function of said target positions and/or velocities of the actuators.
The neural network provides commands at a first frequency, and the control loop mechanism provides torques at a second frequency which is higher than the first frequency.
Step (b) comprises performing temporal and/or spatial regularization, so as to improve smoothness of the commands of the neural network.
The method comprises a step (c) of storing the trained neural network in a memory of the robot.
The bipedal robot is an exoskeleton accommodating a human operator.
According to a second aspect, the disclosure proposes a method for stabilizing a bipedal robot presenting a plurality of degrees of freedom actuated by actuators, characterized it comprises the steps of:
According to a third aspect, the disclosure proposes a system comprising a server and a bipedal robot presenting a plurality of degrees of freedom actuated by actuators, each comprising data processing means, characterized in that said data processing means are respectively configured to implement the method for training a neural network for stabilizing the robot according to the first aspect and the method for stabilizing the exoskeleton according to the second aspect.
According to a fourth and a fifth aspect, the disclosure proposes a computer program product comprising code instructions for executing the method for training a neural network for stabilizing the robot according to the first aspect or the method for stabilizing the robot according to the second aspect; and a computer-readable medium, on which is stored a computer program product comprising code instructions for executing the method for training a neural network for stabilizing the robot according to the first aspect or the method for stabilizing the robot according to the second aspect.
Other characteristics and advantages of the present disclosure will become clear on reading the description that follows of a preferential embodiment. This description will be given with reference to the appended drawings in which:
According to two complementary aspects of the disclosure, are proposed:
These two types of processes are implemented within an architecture as shown in
The server 10 is the training server (implementing the first method) and the robot 1 implements the second method. In other words, the robot 1 directly applies said neural network once trained, to stabilize the exoskeleton (in particular by performing “emergency recovery” as previously explained).
It is quite possible that the server 10 be embedded by the robot 1, but in practice the server 10 is a remote server.
The server 10 is typically a computer equipment connected to the robot 1 via any kind of network 20 such as the Internet network for the exchange of data, even if in practice, once the neural network has been trained and deployed, i.e. provided to the robot 1, the communication can be interrupted, at least intermittently.
Each of the server 10 and the exoskeleton 1 comprises processor-type data processing means 11, 11′ (in particular the data processing means 11 of the server 10 have a high computing power, because the training is long and complex compared to the simple application of the learned neural network to new data, referred to as inference), and if appropriate data storage means 12, 12′ such as a computer memory, for example a hard disk.
It will be understood that there can be a plurality of robots 1 each connected to the server 10.
By bipedal robot 1 it is meant an articulated mechanical system, actuated and commanded, provided with two legs, which is more preferably an exoskeleton such as represented
In the following specification, the preferred example of an exoskeleton will be described, as this is the most difficult type of bipedal robot to stabilize, but the present method is efficient for any other bipedal robot, such as humanoid robots.
By “stabilizing” the exoskeleton 1, it is meant preserving its balance as much as possible and in particular preventing it from falling even in case of strong perturbations thanks to emergency recovery. By “emergency recovery” it is meant a reflex movement of the exoskeleton 1 which is to be leveraged in case of emergency in response to a perturbation (i.e. used as the last resort fallback), such as tilting the pelvis or executing a step. The emergency recovery is successful when the exoskeleton 1 comes back to a safe posture, in particular a standstill posture if it was not walking prior to the perturbation. Note that the recovery should be as “static” as possible (i.e. involve a minimal movement), but in case of a strong perturbation it may sometimes require several steps to prevent falling.
The exoskeleton 1 comprises on each leg a foot structure comprising a support plane on which a foot of a leg of the person wearing the exoskeleton can be supported.
The exoskeleton 1 has a plurality of degrees of freedom, that is to say deformable joints (generally via a rotation) that is to say moveable with respect to each other, which are each either “actuated” or “non-actuated”.
An actuated degree of freedom designates a joint provided with an actuator controlled by data processing means 11′, that is to say that this degree of freedom is controlled and that it is possible to act upon. Conversely, a non-actuated degree of freedom designates a joint not provided with an actuator, that is to say that this degree of freedom follows its own dynamic and that the data processing means 11′ do not have direct control of it (but a priori an indirect control via the other actuated degrees of freedom). In the example of
In a preferred embodiment, the exoskeleton 1 comprises 6 actuators on each leg (i.e. 12 actuators), and a set of sensors referred to as “basic proprioceptive sensors”, such as means for detecting the impact of the feet on the ground 13 and/or at least one inertial measurement unit (IMU) 14. In
The data processing means 11′ designate a computer equipment (typically a processor, either external if the exoskeleton 1 is “remotely controlled” but preferentially embedded in the exoskeleton 1, suited to processing instructions and generating commands intended for the different actuators. As explained, said data processing means 11′ of the exoskeleton 1 will be configured to implement said neural network to control the actuated degrees of freedom (through the actuators). Said actuators may be electric, hydraulic, etc.
The operator can be equipped with a sensor vest 15 that detects the configuration of his torso (orientation of the torso). The direction in which the operator points his chest is the direction in which he wants to walk and the speed is given by the intensity with which he puts his chest forward (how far he bends).
The present application will not be limited to any exoskeleton architecture 1, and the example such as described in the applications WO2015140352 and WO2015140353 will be taken.
Those skilled in the art will however know how to adapt the present method to any other mechanical architecture.
According to the first aspect, the disclosure proposes the method for training the neural network for stabilizing the exoskeleton 1 accommodating a human operator and presenting a plurality of actuated degrees of freedom, performed by the data processing means 11 of the server 10.
While the use of a neural network for stabilizing bipedal robots is known, the present disclosure proposes an original training of such neural network.
More precisely, said neural network is trained by performing a reinforcement learning algorithm in a simulation specifically conceived to trigger emergency recovery. In other words, this neural network simply acts as a last resort solution for stabilizing the exoskeleton 1 in case of emergency. Said neural network does not have another role such as applying trajectories.
By training a neural network, it is meant defining suitable values of parameters θ of the neural network, such that the neural network implements an efficient “policy” πθ for acting in response to a perturbation, see blow.
As explained, the neural network is for example a feedforward neural networks without any memory, possibly with 2 hidden layers with 64 units each, activation layers such as LeakyReLU, a linear output layer. Note that alternative architectures such as with convolution layers are possible. For better generalization, may be chosen architectures with the minimal number of parameters necessary to get satisfactory performance. Hereby, it avoids overfitting, which leads to more predictable and robust behavior on the robot, even for unseen states.
As represented by
The open-source simulator Jiminy can be for example used as the simulation environment, see
By “push”, it is meant an external force applied to the virtual twin exoskeleton 1 for disturbing its stability. To learn sophisticated recovery strategies, the external pushes in the learning environment need to be carefully designed. It must be strong enough to sometimes require stepping, but pushing too hard would prohibit learning. Moreover, increasing the maximum force gradually using curriculum learning is risky, as the policy tends to fall in local minima.
In a preferred embodiment, said sequence of pushes apply forces of constant magnitude for a predetermined (short) duration periodically on the pelvis, where the orientation is sampled from a spherical distribution. In other words, all pushes of the sequence have roughly the same magnitude of force and the same duration but a different orientation.
For instance, the pushes can be applied every 3 s, with a jitter of 2 s to not overfit to a fixed push scheme and learn recovering consecutive pushes. The pushes are bell-shaped instead of uniform for numerical stability, have a peak magnitude of Fmax=800N and are applied during 400 ms.
Then in a step (b), the processing unit 11 performs a reinforcement learning algorithm on said neural network, wherein the neural network provides commands to said actuators (i.e. control them) of the virtual twin of the exoskeleton 1, so as to maximise a reward representative of a recovery of said virtual twin of the exoskeleton 1 from each push. In other words, the better the recovery, the higher the reward.
By “reinforcement learning” (RL), it is meant a machine learning paradigm wherein an agent (the neural network) ought to take actions (emergency recoveries) in an environment (the simulation) in order to maximize the reward. The goal of RL is that the agent learns a “policy”, i.e. the function defining the action to be taken as a function of its state. Reinforcement learning differs from supervised learning in not needing labelled input/output pairs, and in not needing sub-optimal actions to be explicitly corrected. Instead, the focus is on finding a balance between exploration (of uncharted state-action space) and exploitation (of current knowledge).
In order to find the suitable actions to be made, the RL can here use any suitable optimisation technique, in particular policy gradient methods, and preferably the Trust Region Policy Optimisation (TRPO), or even the Proximal Policy Optimization (PPO) which is simplifies the optimization process with respect to TRPO while efficiently preventing “destructive policy updates” (overly large deviating updates of the policy).
Note that RL in simulation has been already used for training neural networks to control bipedal robots and transfer to reality achieved with success (i.e. application of the trained neural network to the real robot), but only for the task of walking, i.e. so as to reproduce a corresponding reference motion from a gait library, see the document Z. Li, X. Cheng, X. B. Peng, P. Abbeel, S. Levine, G. Berseth, and K. Sreenath, “Reinforcement learning for robust parameterized locomotion control of bipedal robots,” CoRR, vol. abs/2103.14295, 2021. While the robot “Cassie” incidentally demonstrates a stability to some pushes in this document, this is mostly due to the design of Cassie—lack of upper body and almost weightless legs—the neural network is actually not trained to emergency recovery and the same level of performance cannot be expected for exoskeletons.
As it will be explained, the neural network preferably controls said actuated degrees of freedom of the virtual twin of the exoskeleton 1 by providing target torques τ to be applied by the actuators as commands. Note that the neural network may directly output the torques, or indirectly outputs other parameter(s) such as target position/velocities of the actuators, from which the torques can be calculated in a known fashion using a control loop mechanism (such as a proportional-derivative controller, see after).
In a preferred embodiment, the neural network and the environment interact according to the
It is thus to be understood that steps (a) and (b) are actually simultaneous: the neural network attempts to performs an emergency recovery for each push of the sequence by suitably controlling the actuators.
Said simulation preferably lasts for a predetermined duration, such as 60 s. Steps (a) and (b) are typically repeated for a plurality of simulations, possibly thousands or even millions of simulations, referred to as “episodes”. Thus, the environment is reset at each new simulation: if the virtual twin falls in a simulation, it can be force-reset or waited up to the beginning of the next simulation.
In addition, at least one terminal condition on the virtual twin of the exoskeleton 1 is preferably enforced during step (b). The terminal conditions are to be seen as hard constraints such as hardware limitations and safety. Besides, they enhance convergence robustness by preventing falling in bad local minima.
Each terminal condition is advantageously chosen among a range of positions and/or orientations of a pelvis of the exoskeleton 1, a minimal distance between feet of the exoskeleton 1, a range of positions and/or velocities of the actuators, a maximum difference with an expected trajectory, a maximum recovery duration, and a maximum power consumption. Preferably, a plurality of these terminal conditions (and even all of them) are enforced.
As mentioned, the RL maximises a reward representative of a recovery of said virtual twin of the exoskeleton 1 from each push. The reward can be seen as a score calculated at each simulation by the training algorithm, that is sought to be as high as possible. The idea is to have a high reward when the virtual twin exoskeleton performs efficient recoveries, i.e. stays balanced, and a low reward else, for example if it falls. To rephrase again, the reward is increasing with the efficiency of the stabilization.
The reward preferably comprises a set of reward components to obtain a natural behavior that is comfortable for the user and to provide insight in how to keep balance.
The reward can also be used as a mean to trigger recovery as late as possible, as it is thought as the last resort emergency strategy.
Note that, in order to be generic, the reward can also be used for reference gait (i.e. following reference trajectory) training.
In a case of plurality of reward components, the total reward can be a weighted sum of the individual objectives:
rt=ΣiωiK(ri) where ri is the reward component, ωi its weight and K a kernel function that is meant to scale them equally, such as the Radial Basis Function (RBF) with cutoff parameter κ (i.e. K(ri)=exp(−κri2)∈[0,1]).
The gradient vanishes for both very small and large value as a side-effect of this scaling. The cutoff parameter κ is used to adjust the operating range of every reward component.
There are lots of possible rewards (or reward components) representative of a recovery of said virtual twin of the exoskeleton 1 from each push, including naïve ones such as simply counting the number of seconds before the exoskeleton falls. The skilled person will be limited to any specific choice of reward(s) as long as it assesses the ability of the virtual twin of the exoskeleton 1 to recover from each push.
We note that in the example of RBF, K has a maximum when ri=0, i.e. decreasing in the interval +. This means that the reward components shall in this case be chosen as decreasing with respect to the ability to recover from each push (i.e. the better the recovery, the lower the reward component value as it is formulating an error), which will be the case in the following examples, but the skilled person can totally use differently constructed reward components which are increasing functions, with the suitable increasing K function.
In the preferred example, there may be 3 class of rewards:
At the end of step (b), a neural network able to command said actuators of the exoskeleton 1 to recover from real life pushes is obtained.
In a following step (c), the trained neural network is stored in a memory 12′ of the exoskeleton 1, typically from a memory 12 of the server 10. This is called “transfer to reality”, as up to now the neural network has only recover from pushes in simulation, and once sored in the memory 12′ the neural network is expected to provide commands to the real actuators of the real exoskeleton 1, so as to perform real recovery of the exoskeleton 1 from unexpected pushes.
Thus, in a second aspect, is proposed the method for stabilizing the exoskeleton 1 accommodating a human operator and presenting a plurality of degrees of freedom actuated by actuators, comprising the steps (d) of providing commands to said actuators of the exoskeleton 1 with the neural network trained using the method according to the first aspect.
As explained the neural network can be generated long before, and be transferred to a large number of exoskeletons 1.
Said second method (stabilizing method) can be numerously performed during the life of the exoskeleton 1, and does not require new transfers, even if the neural network could be updated if an improved policy becomes available.
In lots of prior art attempts, the reality gap has mostly been overlooked and the simulated results hardly transfer to real hardware. Either it is unsuccessful in practice because the physics is over-simplified and hardware limitations are ignored, or regularity is not guaranteed and unexpected hazardous motions can occur.
The present training method enables safe and predictable behavior, which is critical for autonomous systems evolving in a human environment such as bipedal exoskeletons.
In the special case of a medical exoskeleton, the comfort and smoothness is even more critical. Vibrations can cause anxiety, and more importantly, lead to injuries over time since patients have fragile bones.
Therefore, step (b) preferably comprises performing temporal and/or spatial regularization of the policy, so as to improve smoothness of the commands of the neural network. It is also called “conditioning” of the policy.
Usually in RL, smoothness can be promoted by adding regularizers as reward components, such as motor torques, motor velocities or power consumption minimization. However, components in the reward function have no guarantee to be optimized because they have a minor contribution in the actual loss function of RL learning algorithms.
By contrast, injecting the regularization as extra terms in the loss function directly gives control about how much it is enforced during the learning, see for instance the document S. Mysore, B. El Mabsout, R. Mancuso, and K. Saenko, “Regularizing action policies for smooth control with reinforcement learning,” 12 2020.
In the preferred embodiment using temporal and spatial regularization promotes smoothness of the learned state-to-action mappings of the neural network, with for instance the following terms:
These terms can be added to the objection function, possibly with weights, i.e. in the case of PPO (see below): L(θ)=LPPO(θ)+λTLT(θ)+λSLT(θ)
σS is based on expected measurement noise and/or tolerance, which limits its capability to robustness concerns. However, the true power is unveiled when smoothness is further used to shape and enforce a regularity in the behavior of the policy.
By choosing the proper standard deviation, in addition to robustness, a minimal but efficient set of recovery strategies can be learnt, as well as the responsiveness and reactivity of the policy on the real device can be adjusted. To that end, σS is typically comprised between 0.1 and 0.7.
A further possible improvement is the introduction of the L1-norm in the temporal regularization. It still ensures that the policy reacts only if necessary and recovers as fast as possible. Yet, it also avoids penalizing too strictly peaks that would be beneficial to withstand some pushes, smoothing out fast very dynamic motions.
Finally, it is more appropriate to use the mean field
With an episode duration limited to 60 s, it corresponds to T=1500 time steps. In practice, 100M iterations are necessary for asymptotic optimality under worst-case conditions, corresponding to roughly one and half month of experience on a real exoskeleton 1. This takes 6 h, obtaining a satisfying and transferable policy, using 40 independent simulation workers on a single machine with 64 physical cores and 1 GPU Tesla V100.
The training curves of the average episode reward and duration show the impact of the main contributions:
Further tests shows that smoothness conditioning improves the learned behaviour, cancels harmful vibrations and preserves the very dynamic motions. Moreover, it also recovers balance more efficiently, by taking shorter and minimal action.
Finally, the trained neural network has been evaluated for both an user and a dummy on several real Atalante units. Contrary to the learning scenario with only pushes at the pelvis center, the policy can handle many types of external disturbances: the exoskeleton has been pushed in reality at several different application points and impressive results, are obtained. The recovery strategies are reliable for all push variations and even pulling. The transfer to Atalante works out-of-the-box despite wear of the hardware and discrepancy to simulation, notably ground friction, mechanical flexibility and patient disturbances.
According to another aspect, the disclosure relates to the system of the server 10 and the bipedal robot 1 (preferably an exoskeleton accommodating a human operator), for the implementation of the method according to the first aspect (training the neural network) and/or the second aspect (stabilizing the robot).
As explained, each of the server 10 and the robot 1 comprises data processing means 11, 11′ and data storage means 12, 12′ (optionally external). The robot 1 generally also comprises sensors such as inertial measurement means 14 (inertial unit) and/or means for detecting the impact of the feet on the ground 13 (contact sensors or optionally pressure sensors).
It has a plurality of degrees of freedom actuated by an actuator controlled by the data processing means 11′.
The data processing means 11 of the server 10 are configured to implement the method for training a neural network for stabilizing the robot 1 according to the first aspect.
The data processing means 11′ of the robot 1 are configured to implement the method for stabilizing the robot 1 using the neural network according to the second aspect.
According to fourth and fifth aspects, the disclosure relates to a computer program product comprising code instructions for the executing (on the processing means 11, 11′) of a method for training a neural network for stabilizing the robot 1 according to the first aspect or the method for stabilizing the robot 1 according to the second aspect, as well as storage means that can be read by a computer equipment (for example the data storage means 12, 12′) on which is found this computer program product.
Number | Date | Country | Kind |
---|---|---|---|
22305215.0 | Feb 2022 | EP | regional |
This application is the 35 U.S.C. § 371 national stage application of PCT Application No. PCT/EP2023/054307, filed Feb. 21, 2023, which application claims the benefit of European Application No. EP 22305215.0 filed Feb. 25, 2022, both of which are hereby incorporated by reference herein in their entireties.
Filing Document | Filing Date | Country | Kind |
---|---|---|---|
PCT/EP2023/054307 | 2/21/2023 | WO |