Methods for Training a Neural Network and for Using Said Neural Network to Stabilize a Bipedal Robot

Abstract
A method for training a neural network for stabilizing a bipedal robot (1) presenting a plurality of degrees of freedom actuated by actuators is proposed. The method comprises the implementation by the data processing means (11) of a server (10) of steps of: (a) In a simulation, applying a sequence of pushes on a virtual twin of the robot (1).(b) Performing a reinforcement learning algorithm on said neural network, wherein the neural network provides commands to said actuators of the virtual twin of the robot (1), so as to maximise a reward representative of a recovery of said virtual twin of the robot (1) from each push.
Description
GENERAL TECHNICAL FIELD

The present disclosure relates to the field of exoskeleton type robots.


More specifically, it relates to methods for training a neural network and for using said neural network to stabilize a bipedal robot such as an exoskeleton.


BACKGROUND

Recently, for persons with significant mobility problems such as paraplegics, assisted walking devices called exoskeletons have appeared, which are external robotised devices that the operator (the human user) “slips on” thanks to a system of fasteners which links the movements of the exoskeleton with his own movements. The exoskeletons of lower limbs have several joints, generally at least at the level of the knees and the hips, to reproduce the walking movement. Actuators make it possible to move these joints, which in their turn make the operator move. An interface system allows the operator to give orders to the exoskeleton, and a control system transforms these orders into commands for the actuators. Sensors generally complete the device.


These exoskeletons constitute an advance compared to wheelchairs, because they allow operators to get back on their feet and to walk. Exoskeletons are no longer limited by wheels and can theoretically evolve in the majority of non-flat environments: wheels, unlike legs, do not make it possible to clear significant obstacles such as steps, stairs, obstacles that are too high, etc.


However, achieving dynamic stability for such exoskeletons is still a challenge. Continuous feedback control is required to keep balance since the vertical posture is inherently unstable. Trajectory planning for bipedal robots has been solved successfully through whole-body optimization, and stable walking on sensibly flat ground and without major disturbances was achieved on the exoskeleton Atalante of the present applicant by using advanced traditional planning approaches, see for instance the patents applications EP3568266 and FR3101463, or the document T. Gurriet, S. Finet, G. Boeris, A. Duburcq, A. Hereid, O. Harib, M. Masselin, J. Grizzle, and A. D. Ames, “Towards restoring locomotion for paraplegics: Realizing dynamically stable walking on exoskeletons,” in 2018 IEEE International Conference on Robotics and Automation (ICRA), 2018, pp. 2804-2811.


Yet, so-called “emergency recovery” is still an open problem. In more details, while the exoskeleton is perfectly able to keep a smooth and stable trajectory while walking, it may be thrown off balance if it undergoes a strong and unexpected disturbance like a push, which could result in the human operator being hurt and/or the robot being damaged.


For small perturbations, in-place recovery strategies controlling the Center of Pressure (CoP), the centroidal angular momentum, or using foot-tilting are sufficient.


To handle stronger perturbations, controllers based on Zero-Moment Point (ZMP) trajectory generation have been proposed, along with Model Predictive Control (MPC) methods controlling the ZMP, but in practice the efficiency was very limited.


Consequently, there is still a need for a purely reactive controller for standing push recovery on exoskeletons, to be used as the last resort fallback in case of emergency, and able to feature a variety of human-like balancing strategies while guaranteeing predictable, safe and smooth behavior.


SUMMARY

For these purposes, the present disclosure provides, according to a first aspect, a method for training a neural network for stabilizing a bipedal robot presenting a plurality of degrees of freedom actuated by actuators, the method being characterised in that it comprises the implementation by the data processing means of a server of steps of:

    • (a) In a simulation, applying a sequence of pushes on a virtual twin of the robot.
    • (b) Performing a reinforcement learning algorithm on said neural network, wherein the neural network provides commands to said actuators of the virtual twin of the robot, so as to maximise a reward representative of a recovery of said virtual twin of the robot from each push.


Preferred but non limiting features are as follow:


Said simulation lasts for a predetermined duration, steps (a) and (b) being repeated for a plurality of simulations.


Pushes are applied periodically over the duration of the simulation, with forces of constant magnitude applied for a predetermined duration.


Pushes are applied on a pelvis of the virtual twin of the robot with an orientation sampled from a spherical distribution.


At least one terminal condition on the virtual twin of the robot is enforced during step (b).


Said terminal condition is chosen among a range of positions and/or orientations of a pelvis of the exoskeleton, a minimal distance between feet of the robot, a range of positions and/or velocities of the actuators, a maximum difference with an expected trajectory, a maximum recovery duration, and a maximum power consumption.


Said simulation outputs a state of the virtual twin of the robot as a function of the pushes and the commands provided by the neural network.


The robot comprises at least one sensor for observing the state of the robot, wherein the neural network takes as input in step (b) the state of the virtual twin of the robot as outputted by the simulation.


The neural network provides as commands target positions and/or velocities of the actuators, and a control loop mechanism determines torques to be applied by the actuators as a function of said target positions and/or velocities of the actuators.


The neural network provides commands at a first frequency, and the control loop mechanism provides torques at a second frequency which is higher than the first frequency.


Step (b) comprises performing temporal and/or spatial regularization, so as to improve smoothness of the commands of the neural network.


The method comprises a step (c) of storing the trained neural network in a memory of the robot.


The bipedal robot is an exoskeleton accommodating a human operator.


According to a second aspect, the disclosure proposes a method for stabilizing a bipedal robot presenting a plurality of degrees of freedom actuated by actuators, characterized it comprises the steps of:

    • (d) providing commands to said actuators of the robot with the neural network trained using the method according to the first aspect.


According to a third aspect, the disclosure proposes a system comprising a server and a bipedal robot presenting a plurality of degrees of freedom actuated by actuators, each comprising data processing means, characterized in that said data processing means are respectively configured to implement the method for training a neural network for stabilizing the robot according to the first aspect and the method for stabilizing the exoskeleton according to the second aspect.


According to a fourth and a fifth aspect, the disclosure proposes a computer program product comprising code instructions for executing the method for training a neural network for stabilizing the robot according to the first aspect or the method for stabilizing the robot according to the second aspect; and a computer-readable medium, on which is stored a computer program product comprising code instructions for executing the method for training a neural network for stabilizing the robot according to the first aspect or the method for stabilizing the robot according to the second aspect.





DESCRIPTION OF THE FIGURES

Other characteristics and advantages of the present disclosure will become clear on reading the description that follows of a preferential embodiment. This description will be given with reference to the appended drawings in which:



FIG. 1 illustrate an architecture for the implementation of the methods according to the disclosure;



FIG. 2 represents an exoskeleton used in the methods according to the disclosure;



FIG. 3 is a diagram illustrating a preferred embodiment of the methods according to the disclosure;



FIG. 4 represents a virtual twin of the exoskeleton in a simulation;



FIG. 5 is on overview of a proposed control system in the methods according to the disclosure.





DETAILED DESCRIPTION
Architecture

According to two complementary aspects of the disclosure, are proposed:

    • a method for training a neural network, in particular of the “forward propagation” type (FNN, “Feedforward Neural Network”) for stabilizing a bipedal robot 1; and
    • a method for stabilizing a bipedal robot 1 (using a neural network, advantageously trained thanks to the aforementioned method).


These two types of processes are implemented within an architecture as shown in FIG. 1, thanks to a server 10.


The server 10 is the training server (implementing the first method) and the robot 1 implements the second method. In other words, the robot 1 directly applies said neural network once trained, to stabilize the exoskeleton (in particular by performing “emergency recovery” as previously explained).


It is quite possible that the server 10 be embedded by the robot 1, but in practice the server 10 is a remote server.


The server 10 is typically a computer equipment connected to the robot 1 via any kind of network 20 such as the Internet network for the exchange of data, even if in practice, once the neural network has been trained and deployed, i.e. provided to the robot 1, the communication can be interrupted, at least intermittently.


Each of the server 10 and the exoskeleton 1 comprises processor-type data processing means 11, 11′ (in particular the data processing means 11 of the server 10 have a high computing power, because the training is long and complex compared to the simple application of the learned neural network to new data, referred to as inference), and if appropriate data storage means 12, 12′ such as a computer memory, for example a hard disk.


It will be understood that there can be a plurality of robots 1 each connected to the server 10.


By bipedal robot 1 it is meant an articulated mechanical system, actuated and commanded, provided with two legs, which is more preferably an exoskeleton such as represented FIG. 2, which is a more specific bipedal robot accommodating a human operator having his lower limbs each integral with a leg of the exoskeleton 1. This can be ensured notably thanks to straps.


In the following specification, the preferred example of an exoskeleton will be described, as this is the most difficult type of bipedal robot to stabilize, but the present method is efficient for any other bipedal robot, such as humanoid robots.


By “stabilizing” the exoskeleton 1, it is meant preserving its balance as much as possible and in particular preventing it from falling even in case of strong perturbations thanks to emergency recovery. By “emergency recovery” it is meant a reflex movement of the exoskeleton 1 which is to be leveraged in case of emergency in response to a perturbation (i.e. used as the last resort fallback), such as tilting the pelvis or executing a step. The emergency recovery is successful when the exoskeleton 1 comes back to a safe posture, in particular a standstill posture if it was not walking prior to the perturbation. Note that the recovery should be as “static” as possible (i.e. involve a minimal movement), but in case of a strong perturbation it may sometimes require several steps to prevent falling.


The exoskeleton 1 comprises on each leg a foot structure comprising a support plane on which a foot of a leg of the person wearing the exoskeleton can be supported.


The exoskeleton 1 has a plurality of degrees of freedom, that is to say deformable joints (generally via a rotation) that is to say moveable with respect to each other, which are each either “actuated” or “non-actuated”.


An actuated degree of freedom designates a joint provided with an actuator controlled by data processing means 11′, that is to say that this degree of freedom is controlled and that it is possible to act upon. Conversely, a non-actuated degree of freedom designates a joint not provided with an actuator, that is to say that this degree of freedom follows its own dynamic and that the data processing means 11′ do not have direct control of it (but a priori an indirect control via the other actuated degrees of freedom). In the example of FIG. 2, the heel-ground contact is punctual, and the exoskeleton 2 is thereby free in rotation with respect to this contact point. The angle between the heel-hip axis and the vertical then constitutes a non-actuated degree of freedom.


In a preferred embodiment, the exoskeleton 1 comprises 6 actuators on each leg (i.e. 12 actuators), and a set of sensors referred to as “basic proprioceptive sensors”, such as means for detecting the impact of the feet on the ground 13 and/or at least one inertial measurement unit (IMU) 14. In FIG. 2, there is a “pelvis IMU” 14 at the back of the exoskeleton, but there could be further IMUs such as left/right tibia IMUs and/or left/right foot IMUs, etc. Note that there could be further sensors, and in particular an encoder for each actuator (measuring the actuator position).


The data processing means 11′ designate a computer equipment (typically a processor, either external if the exoskeleton 1 is “remotely controlled” but preferentially embedded in the exoskeleton 1, suited to processing instructions and generating commands intended for the different actuators. As explained, said data processing means 11′ of the exoskeleton 1 will be configured to implement said neural network to control the actuated degrees of freedom (through the actuators). Said actuators may be electric, hydraulic, etc.


The operator can be equipped with a sensor vest 15 that detects the configuration of his torso (orientation of the torso). The direction in which the operator points his chest is the direction in which he wants to walk and the speed is given by the intensity with which he puts his chest forward (how far he bends).


The present application will not be limited to any exoskeleton architecture 1, and the example such as described in the applications WO2015140352 and WO2015140353 will be taken.


Those skilled in the art will however know how to adapt the present method to any other mechanical architecture.


Training Method

According to the first aspect, the disclosure proposes the method for training the neural network for stabilizing the exoskeleton 1 accommodating a human operator and presenting a plurality of actuated degrees of freedom, performed by the data processing means 11 of the server 10.


While the use of a neural network for stabilizing bipedal robots is known, the present disclosure proposes an original training of such neural network.


More precisely, said neural network is trained by performing a reinforcement learning algorithm in a simulation specifically conceived to trigger emergency recovery. In other words, this neural network simply acts as a last resort solution for stabilizing the exoskeleton 1 in case of emergency. Said neural network does not have another role such as applying trajectories.


By training a neural network, it is meant defining suitable values of parameters θ of the neural network, such that the neural network implements an efficient “policy” πθ for acting in response to a perturbation, see blow.


As explained, the neural network is for example a feedforward neural networks without any memory, possibly with 2 hidden layers with 64 units each, activation layers such as LeakyReLU, a linear output layer. Note that alternative architectures such as with convolution layers are possible. For better generalization, may be chosen architectures with the minimal number of parameters necessary to get satisfactory performance. Hereby, it avoids overfitting, which leads to more predictable and robust behavior on the robot, even for unseen states.


As represented by FIG. 3, in a first step (a), the method comprises applying, in said simulation, a sequence of pushes on a virtual twin of the exoskeleton 1. By “virtual twin”, it is meant a virtual replica of the exoskeleton 1 in the simulation, with the same actuators, sensors, properties, etc. The human operator is advantageously considered as rigidly fastened to the exoskeleton 1. In this regard, the system exoskeleton-human can be viewed as a humanoid robot after aggregating their respective mass distributions.


The open-source simulator Jiminy can be for example used as the simulation environment, see FIG. 4. It has been created for control robotic problems and designed for realistic simulation of legged robots. Jiminy includes motor inertia and friction, sensor noise and delay, accurate constraint-based ground reaction forces, as well as an advanced mechanical flexibility model.


By “push”, it is meant an external force applied to the virtual twin exoskeleton 1 for disturbing its stability. To learn sophisticated recovery strategies, the external pushes in the learning environment need to be carefully designed. It must be strong enough to sometimes require stepping, but pushing too hard would prohibit learning. Moreover, increasing the maximum force gradually using curriculum learning is risky, as the policy tends to fall in local minima.


In a preferred embodiment, said sequence of pushes apply forces of constant magnitude for a predetermined (short) duration periodically on the pelvis, where the orientation is sampled from a spherical distribution. In other words, all pushes of the sequence have roughly the same magnitude of force and the same duration but a different orientation.


For instance, the pushes can be applied every 3 s, with a jitter of 2 s to not overfit to a fixed push scheme and learn recovering consecutive pushes. The pushes are bell-shaped instead of uniform for numerical stability, have a peak magnitude of Fmax=800N and are applied during 400 ms.


Then in a step (b), the processing unit 11 performs a reinforcement learning algorithm on said neural network, wherein the neural network provides commands to said actuators (i.e. control them) of the virtual twin of the exoskeleton 1, so as to maximise a reward representative of a recovery of said virtual twin of the exoskeleton 1 from each push. In other words, the better the recovery, the higher the reward.


By “reinforcement learning” (RL), it is meant a machine learning paradigm wherein an agent (the neural network) ought to take actions (emergency recoveries) in an environment (the simulation) in order to maximize the reward. The goal of RL is that the agent learns a “policy”, i.e. the function defining the action to be taken as a function of its state. Reinforcement learning differs from supervised learning in not needing labelled input/output pairs, and in not needing sub-optimal actions to be explicitly corrected. Instead, the focus is on finding a balance between exploration (of uncharted state-action space) and exploitation (of current knowledge).


In order to find the suitable actions to be made, the RL can here use any suitable optimisation technique, in particular policy gradient methods, and preferably the Trust Region Policy Optimisation (TRPO), or even the Proximal Policy Optimization (PPO) which is simplifies the optimization process with respect to TRPO while efficiently preventing “destructive policy updates” (overly large deviating updates of the policy).


Note that RL in simulation has been already used for training neural networks to control bipedal robots and transfer to reality achieved with success (i.e. application of the trained neural network to the real robot), but only for the task of walking, i.e. so as to reproduce a corresponding reference motion from a gait library, see the document Z. Li, X. Cheng, X. B. Peng, P. Abbeel, S. Levine, G. Berseth, and K. Sreenath, “Reinforcement learning for robust parameterized locomotion control of bipedal robots,” CoRR, vol. abs/2103.14295, 2021. While the robot “Cassie” incidentally demonstrates a stability to some pushes in this document, this is mostly due to the design of Cassie—lack of upper body and almost weightless legs—the neural network is actually not trained to emergency recovery and the same level of performance cannot be expected for exoskeletons.


As it will be explained, the neural network preferably controls said actuated degrees of freedom of the virtual twin of the exoskeleton 1 by providing target torques τ to be applied by the actuators as commands. Note that the neural network may directly output the torques, or indirectly outputs other parameter(s) such as target position/velocities of the actuators, from which the torques can be calculated in a known fashion using a control loop mechanism (such as a proportional-derivative controller, see after).


In a preferred embodiment, the neural network and the environment interact according to the FIG. 5:

    • The state of the system st at time t is defined by the position qx,y,z:=[qx; qy; qz] T, orientation qψ,v,φ, linear velocity {dot over (q)}x,y,z and angular velocity {dot over (q)}ψ,v,φ of the pelvis of the exoskeleton 1, in addition to the positions qm and velocities {dot over (q)}m of the actuated degrees of freedom (i.e. motor position and velocities). It is to be understood that the state of the exoskeleton also represents the non-actuated degrees of freedom, on which the neural network does not have a direct control. The state of the human operator is discarded since it is not observable and the coupling with the exoskeleton not modelled. As a result, the human is considered as an external disturbance. Even so, the state of the system st is not fully observable e.g. due to flexibilities.
    • The observation ot∈O⊂custom-charactern is computed from the measurement of the sensors 13, 14. This observation space can reach 49 dimensions (less than the state space, which can reach 68, as not all the state is observable). Note that, the target motor positions {tilde over (q)}mt−1 from the last time step can be included in the observation. No window over several time steps is accumulated. The quantities are individually normalized over training batches to avoid manual estimation of the sensitivity of every component of the observation space, which would be hazardous since it heavily depends on policy itself. Some insightful quantities cannot be reliably estimated without exteroceptive sensors, e.g. the pelvis height qz and its linear velocity {dot over (q)}x; y; z. They are not included in the observation space because any significant mismatch between simulated and real data may prevent transfer to reality.
    • s0 is the state wherein the agent starts an episode (i.e. a simulation), which may vary. The initial state distribution ρ0(s0):S→O defines the diversity of states s0 in which the agent can start. Choosing this distribution appropriately is essential to explore a large variety of states and to discover robust recovery strategies. Indeed, to be efficient, it is necessary to sample as many recoverable states as possible. Naive random sampling would lead to many unfeasible configurations that trigger terminal conditions (see after) instantly or are unrecoverable. Therefore, the best solution is to sample so normally distributed around a set of diverse reference trajectories, which enables to explore many different but feasible configurations.
    • the neural network outputs target velocities of the actuated degrees of freedom as a function of the observation ot and the target positions {tilde over (q)}m of the actuators that may be obtained by integration from the outputted target velocities to enforce a smooth behavior, at a preferred first frequency of 25 Hz.
    • As explained, the neural network is preferably configured to output some “high-level” targets (the target positions and/or velocities of the actuators), that are forwarded to a decentralized low-level control loop mechanism (in particular a PD controller) determining torques τ at a higher second frequency (such as 100 Hz). Such hybrid control architecture improves efficiency, robustness, and transferability on real hardware compared to predicting torques directly. Decoupled control of each joint is well-known to be robust to model uncertainties. Moreover, these controllers can be tuned to trade off tracking accuracy versus compliance, smoothing out vibrations and errors in the predicted actions.
    • The target torques are applied to the virtual twin of the exoskeleton 1 in the simulation to update the state. The virtual twin “moves” and consequently the simulator outputs the updated positions and velocities of the actuators to the PD controller, and most importantly the updated observation ot.


It is thus to be understood that steps (a) and (b) are actually simultaneous: the neural network attempts to performs an emergency recovery for each push of the sequence by suitably controlling the actuators.


Said simulation preferably lasts for a predetermined duration, such as 60 s. Steps (a) and (b) are typically repeated for a plurality of simulations, possibly thousands or even millions of simulations, referred to as “episodes”. Thus, the environment is reset at each new simulation: if the virtual twin falls in a simulation, it can be force-reset or waited up to the beginning of the next simulation.


In addition, at least one terminal condition on the virtual twin of the exoskeleton 1 is preferably enforced during step (b). The terminal conditions are to be seen as hard constraints such as hardware limitations and safety. Besides, they enhance convergence robustness by preventing falling in bad local minima.


Each terminal condition is advantageously chosen among a range of positions and/or orientations of a pelvis of the exoskeleton 1, a minimal distance between feet of the exoskeleton 1, a range of positions and/or velocities of the actuators, a maximum difference with an expected trajectory, a maximum recovery duration, and a maximum power consumption. Preferably, a plurality of these terminal conditions (and even all of them) are enforced.

    • 1) range of positions and/or orientations of a pelvis. To avoid frightening the user and people around, the pelvis motion is restricted. For instance the following ranges can be applied:








q
z

>

0.3

m


;



-
0.4



rad

<

q
ψ

<

0.4

rad


;



-
0.25



rad

<

q
φ

<

0.7

rad








    • 2) minimal distance between feet. For safety, foot collisions need to be avoided which can hurt the human operator and damage the robot. For instance the following formula can be applied:
      • D(CHr; CHl)>0:02; where CHr; l is the convex hull of the right and left footprints respectively and D the euclidean distance.

    • 3) range of positions and/or velocities of the actuators. Hitting the mechanical bounds of an actuator i with a large velocity leads to destructive damage, i.e the following formula can be applied:











q
.

i

<

0.6

rad
/
s


or



q
i
-


<

q
i

<

q
i
+









      • It is critical to define the extremal positions qi and qi+ depending on the velocity, as otherwise the problem is too strictly constrained.



    • 4) maximum difference with an expected trajectory, i.e. reference dynamics. As explained, the exoskeleton might already be following a reference trajectory defined by a function noted {circumflex over (q)}m(t) (expected positions of actuators over the trajectory-which is equal to 0 in case of a standstill posture), and odometry change has to be restricted for the pelvis position in “world plan”, noted pb=qx,y,φ. This avoids long-term divergence when the exoskeleton is already able to track an expected reference trajectory. For instance the following formula can be applied:













"\[LeftBracketingBar]"



Δ


q

x
,
y
,
φ



-

Δ



q
^


x
,
y
,
φ






"\[RightBracketingBar]"


<

[

2.
,
3.
,

π
2


]


,


where


Δ★

=




t

-




t
-

Δ

T





with


Δ

T


=

20



s
.












      • Note that in the following specification, each quantity noted with a {circumflex over ( )} will be understood as this quantity in the reference trajectory. For example, {circumflex over (p)}b(t) is the expected pelvis position of the reference trajectory.



    • 5) maximum recovery duration, i.e. “transient dynamics”: To handle strong pushes, large deviations of actuator positions shall be allowed, but coming back quickly to the reference trajectory afterwards. For instance the following formula can be applied:













t






[


t
-

Δ

T


,
t

]



such


that








q
m

(

t


)

-



q
^

m

(

t


)




2


<

0.3

with


Δ

T




=

4


s







    • 6) maximum power consumption: the power consumption is to be limited to fit the hardware specification, for instance at 3 kW.





Reward Shaping

As mentioned, the RL maximises a reward representative of a recovery of said virtual twin of the exoskeleton 1 from each push. The reward can be seen as a score calculated at each simulation by the training algorithm, that is sought to be as high as possible. The idea is to have a high reward when the virtual twin exoskeleton performs efficient recoveries, i.e. stays balanced, and a low reward else, for example if it falls. To rephrase again, the reward is increasing with the efficiency of the stabilization.


The reward preferably comprises a set of reward components to obtain a natural behavior that is comfortable for the user and to provide insight in how to keep balance.


The reward can also be used as a mean to trigger recovery as late as possible, as it is thought as the last resort emergency strategy.


Note that, in order to be generic, the reward can also be used for reference gait (i.e. following reference trajectory) training.


In a case of plurality of reward components, the total reward can be a weighted sum of the individual objectives:


rtiωiK(ri) where ri is the reward component, ωi its weight and K a kernel function that is meant to scale them equally, such as the Radial Basis Function (RBF) with cutoff parameter κ (i.e. K(ri)=exp(−κri2)∈[0,1]).


The gradient vanishes for both very small and large value as a side-effect of this scaling. The cutoff parameter κ is used to adjust the operating range of every reward component.


There are lots of possible rewards (or reward components) representative of a recovery of said virtual twin of the exoskeleton 1 from each push, including naïve ones such as simply counting the number of seconds before the exoskeleton falls. The skilled person will be limited to any specific choice of reward(s) as long as it assesses the ability of the virtual twin of the exoskeleton 1 to recover from each push.


We note that in the example of RBF, K has a maximum when ri=0, i.e. decreasing in the interval custom-character+. This means that the reward components shall in this case be chosen as decreasing with respect to the ability to recover from each push (i.e. the better the recovery, the lower the reward component value as it is formulating an error), which will be the case in the following examples, but the skilled person can totally use differently constructed reward components which are increasing functions, with the suitable increasing K function.


In the preferred example, there may be 3 class of rewards:

    • 1) Reference Dynamics. A set of independent high-level features can be extracted to define reward components for each of them, so as to promote boiling down to a comfortable and stable resting pose for push recovery.
      • Odometry. As explained, regarding the safety of push recovery steps, odometry change has to be as limited as possible for the pelvis position pb, i.e. the reward component could be ∥pb−{circumflex over (p)}b(t)∥2.
      • Reference configuration. The expected position shall follow the reference trajectory as close as possible, i.e. the reward component could be ∥qm−{circumflex over (q)}m(t)∥2.
      • Foot positions and orientations. When recovering, the feet should be flat on the ground at a suitable distance from each other. Without promoting it specifically, the policy learns spreading the legs, which is both suboptimal and unnatural. One has to work in a symmetric, local frame, to ensure this reward is decoupled from the pelvis state, which would lead to unwanted side effect otherwise. If the “mean foot yaw” is introduced as φ=(pφl+pφr)/2 (with pφl and pφr the respective yaw of the left and right feet) and then the relative position of the feet pr−l=Rφ(pr−pl), rewards for the foot position and orientation could be defined as ∥(px,yr−l−{circumflex over (p)}x,yr−l,pzr−l)∥, ∥(pψ,vr,l,pψ,vr,l−{circumflex over (p)}φr,l)∥.
    • 2) Transient Dynamics. Following a strong perturbation, recovery steps are executed to prevent falling in any case.
      • Foot placement. The idea is to act as soon as the Center of Pressure (CP) goes outside the support polygon, by encouraging moving the pelvis toward the CP to get it back under the feet, with for instance the reward component custom-characterpcpcustom-character{circumflex over (p)}cp(t)∥2 where custom-charactercp is the relative CP position in odometry frame.
    • Dynamic stability. The ZMP should be kept inside the support polygon for dynamic stability, with for instance the reward component ∥pzmp−pSP∥, where SP is the center of the projected support polygon instead of real support polygon. It anticipates future impact with the ground and is agnostic to the contact states.
    • 3) Safety and Comfort: Safety needs to be guaranteed during operation of the exoskeleton 1. Comfort has the same priority for a medical exoskeleton to enhance rehabilitation.
      • Balanced contact forces. Distributing the weight evenly on both feet is key in natural standing, with for instance the reward component Fzrδr+Fzlδl−mg ∨ where δr, δl∈{0,1} are the right and left contact states and Fzr, Fzl are the vertical contact forces.
      • Ground friction. Reducing the tangential contact forces limits internal constraints in the mechanical structure that could lead to overheating and discomfort. Moreover, exploiting too much the friction may lead to unrealistic behavior, hence the reward component could be ∥Fx,yl,r2 where Fx,yl,r are the x and y tangential forces of left and right foot.
      • Pelvis momentum. Large pelvis motion is unpleasant. Besides, reducing the angular momentum helps to keep balance, hence the reward component could be ∥{dot over (q)}ψ,v,φ−{circumflex over ({dot over (q)})}ψ,v,φ2


At the end of step (b), a neural network able to command said actuators of the exoskeleton 1 to recover from real life pushes is obtained.


Transfer to Real Word

In a following step (c), the trained neural network is stored in a memory 12′ of the exoskeleton 1, typically from a memory 12 of the server 10. This is called “transfer to reality”, as up to now the neural network has only recover from pushes in simulation, and once sored in the memory 12′ the neural network is expected to provide commands to the real actuators of the real exoskeleton 1, so as to perform real recovery of the exoskeleton 1 from unexpected pushes.


Thus, in a second aspect, is proposed the method for stabilizing the exoskeleton 1 accommodating a human operator and presenting a plurality of degrees of freedom actuated by actuators, comprising the steps (d) of providing commands to said actuators of the exoskeleton 1 with the neural network trained using the method according to the first aspect.


As explained the neural network can be generated long before, and be transferred to a large number of exoskeletons 1.


Said second method (stabilizing method) can be numerously performed during the life of the exoskeleton 1, and does not require new transfers, even if the neural network could be updated if an improved policy becomes available.


Policy Regularization

In lots of prior art attempts, the reality gap has mostly been overlooked and the simulated results hardly transfer to real hardware. Either it is unsuccessful in practice because the physics is over-simplified and hardware limitations are ignored, or regularity is not guaranteed and unexpected hazardous motions can occur.


The present training method enables safe and predictable behavior, which is critical for autonomous systems evolving in a human environment such as bipedal exoskeletons.


In the special case of a medical exoskeleton, the comfort and smoothness is even more critical. Vibrations can cause anxiety, and more importantly, lead to injuries over time since patients have fragile bones.


Therefore, step (b) preferably comprises performing temporal and/or spatial regularization of the policy, so as to improve smoothness of the commands of the neural network. It is also called “conditioning” of the policy.


Usually in RL, smoothness can be promoted by adding regularizers as reward components, such as motor torques, motor velocities or power consumption minimization. However, components in the reward function have no guarantee to be optimized because they have a minor contribution in the actual loss function of RL learning algorithms.


By contrast, injecting the regularization as extra terms in the loss function directly gives control about how much it is enforced during the learning, see for instance the document S. Mysore, B. El Mabsout, R. Mancuso, and K. Saenko, “Regularizing action policies for smooth control with reinforcement learning,” 12 2020.


In the preferred embodiment using temporal and spatial regularization promotes smoothness of the learned state-to-action mappings of the neural network, with for instance the following terms:

    • temporal regularization term LT(θ)=∥πθ(st)−πθ(st+1)∥1
    • spatial regularization term LS(θ)=∥πθ(st)−πθ(st)∥22 where {tilde over (s)}t≈N(st, σS), i.e. a sampling function assuming a normal distribution of standard deviation σS centred on st.


These terms can be added to the objection function, possibly with weights, i.e. in the case of PPO (see below): L(θ)=LPPO(θ)+λTLT(θ)+λSLT(θ)


σS is based on expected measurement noise and/or tolerance, which limits its capability to robustness concerns. However, the true power is unveiled when smoothness is further used to shape and enforce a regularity in the behavior of the policy.


By choosing the proper standard deviation, in addition to robustness, a minimal but efficient set of recovery strategies can be learnt, as well as the responsiveness and reactivity of the policy on the real device can be adjusted. To that end, σS is typically comprised between 0.1 and 0.7.


A further possible improvement is the introduction of the L1-norm in the temporal regularization. It still ensures that the policy reacts only if necessary and recovers as fast as possible. Yet, it also avoids penalizing too strictly peaks that would be beneficial to withstand some pushes, smoothing out fast very dynamic motions.


Finally, it is more appropriate to use the mean field πθ in the regularization instead of the actual policy outputs πθ. It still provides the same gradient on the network weights θ but it is now independent of the exploration, which avoids penalizing it.


Results

With an episode duration limited to 60 s, it corresponds to T=1500 time steps. In practice, 100M iterations are necessary for asymptotic optimality under worst-case conditions, corresponding to roughly one and half month of experience on a real exoskeleton 1. This takes 6 h, obtaining a satisfying and transferable policy, using 40 independent simulation workers on a single machine with 64 physical cores and 1 GPU Tesla V100.


The training curves of the average episode reward and duration show the impact of the main contributions:

    • Smoothness conditioning (regularization) slightly slows down the convergence (60M vs 50M), but does not impede the asymptotic reward.
    • Using a simplistic reward, namely +1 per time step until early termination (fall), similar training performance can be observed until 25M thanks to the well-defined terminal conditions. After this point, the convergence gets slower but the policy is just slightly under-performing at the end. This result validates convergence robustness and that the general idea to use a reward representative of a recovery of said virtual twin of the exoskeleton 1 from each push provides insight how to recover balance.
    • Without the terminal conditions for safety and transferability, fast convergence in around 30M can be achieved. However, it would not be possible to use such policy on the real device.


Further tests shows that smoothness conditioning improves the learned behaviour, cancels harmful vibrations and preserves the very dynamic motions. Moreover, it also recovers balance more efficiently, by taking shorter and minimal action.


Finally, the trained neural network has been evaluated for both an user and a dummy on several real Atalante units. Contrary to the learning scenario with only pushes at the pelvis center, the policy can handle many types of external disturbances: the exoskeleton has been pushed in reality at several different application points and impressive results, are obtained. The recovery strategies are reliable for all push variations and even pulling. The transfer to Atalante works out-of-the-box despite wear of the hardware and discrepancy to simulation, notably ground friction, mechanical flexibility and patient disturbances.


System

According to another aspect, the disclosure relates to the system of the server 10 and the bipedal robot 1 (preferably an exoskeleton accommodating a human operator), for the implementation of the method according to the first aspect (training the neural network) and/or the second aspect (stabilizing the robot).


As explained, each of the server 10 and the robot 1 comprises data processing means 11, 11′ and data storage means 12, 12′ (optionally external). The robot 1 generally also comprises sensors such as inertial measurement means 14 (inertial unit) and/or means for detecting the impact of the feet on the ground 13 (contact sensors or optionally pressure sensors).


It has a plurality of degrees of freedom actuated by an actuator controlled by the data processing means 11′.


The data processing means 11 of the server 10 are configured to implement the method for training a neural network for stabilizing the robot 1 according to the first aspect.


The data processing means 11′ of the robot 1 are configured to implement the method for stabilizing the robot 1 using the neural network according to the second aspect.


Computer Programme Product

According to fourth and fifth aspects, the disclosure relates to a computer program product comprising code instructions for the executing (on the processing means 11, 11′) of a method for training a neural network for stabilizing the robot 1 according to the first aspect or the method for stabilizing the robot 1 according to the second aspect, as well as storage means that can be read by a computer equipment (for example the data storage means 12, 12′) on which is found this computer program product.

Claims
  • 1. A method for training a neural network for stabilizing a bipedal robot comprising the implementation by a data processing means of a server of steps of: (a) applying, in a simulation, a sequence of pushes on a virtual twin of a bipedal robot presenting a plurality of degrees of freedom actuated by actuators; and(b) performing a reinforcement learning algorithm on said neural network, wherein the neural network provides commands to actuators of the virtual twin of the bipedal robot so as to maximise a reward representative of a recovery of said virtual twin of the bipedal robot from each push.
  • 2. The method according to claim 1, wherein said simulation lasts for a predetermined duration, steps (a) and (b) being repeated for a plurality of simulations.
  • 3. The method according to claim 2, wherein pushes are applied periodically over the predetermined duration of the simulation, with forces of constant magnitude applied for a predefined duration.
  • 4. The method according to claim 1, wherein pushes are applied on a pelvis of the virtual twin of the bipedal robot with an orientation sampled from a spherical distribution.
  • 5. The method according to claim 1, wherein at least one terminal condition on the virtual twin of the bipedal robot is enforced during step (b).
  • 6. The method according to claim 5, wherein said at least one terminal condition includes at least one of: a minimal distance between feet of the bipedal robot, a range of positions of the actuators, a range of velocities of the actuators, a maximum difference with an expected trajectory, a maximum recovery duration, a maximum power consumption.
  • 7. The method according to claim 6, wherein the minimal distance between feet of the bipedal robot complies with the following formula: D(CHr; CHl)>0:02, where CHr; l is a convex hull of a right footprint and a left footprint, respectively, and D is a Euclidean distance.
  • 8. The method according to claim 6, wherein the range of velocities of the actuators consists of a range of velocities that are below 0.6 rad/s.
  • 9. The method according to claim 5, wherein said at least one terminal condition includes a minimal height of the pelvis which is above 0.3 m, minimal and maximal yaw angles of the pelvis comprised between −0.4 and 0.4 rad, and minimal and maximal roll angles of the pelvis comprised between −0.25 and 0.7 rad.
  • 10. The method according to claim 1, wherein said simulation outputs a state of the virtual twin of the bipedal robot as a function of the pushes and the commands provided by the neural network.
  • 11. The method according to claim 10, wherein the bipedal robot comprises at least one sensor for observing a state of the bipedal robot, wherein the neural network takes as input in step (b) the state of the virtual twin of the robot as outputted by the simulation.
  • 12. The method according to claim 1, wherein the neural network provides as commands target positions and/or velocities of the actuators, and a control loop mechanism determines torques to be applied by the actuators as a function of said target positions and/or velocities of the actuators.
  • 13. The method according to claim 12, wherein the neural network provides commands at a first frequency, and the control loop mechanism provides torques at a second frequency which is higher than the first frequency.
  • 14. The method according to claim 1, wherein the neural network is trained to learn a policy, step (b) comprising performing temporal and/or spatial regularization of the policy, so as to improve smoothness of the commands of the neural network.
  • 15. The method according to claim 14, wherein step (b) includes performing spatial regularization using a spatial regularization term which is a function of a distance between a first value of the policy for a first state and a second value of the policy for a second, similar state the similar state being a result of a sampling function assuming a normal distribution of standard deviation centred on the first state.
  • 16. The method according to claim 15, wherein the standard deviation is comprised between 0.1 and 0.7.
  • 17. The method according to claim 14, wherein step (b) includes performing temporal regularization using a temporal regularization term which is a function of a distance between a first value of the policy for a first state and a second value of the policy for a second, subsequent state.
  • 18. The method according to claim 17, wherein the distance uses a L1-norm.
  • 19. The method according to claim 15, wherein the first value consists of a mean value of the policy for the first state, the second value consisting of a mean value of the policy for the second state.
  • 20. The method according to claim 14, wherein the regularization is applied to a mean field of the policy.
  • 21. The method according to claim 1, comprising a step (c) of storing the trained neural network in a memory of the bipedal robot.
  • 22. The method according to claim 1, wherein the bipedal robot is an exoskeleton accommodating a human operator.
  • 23. A method for stabilizing a bipedal robot comprising providing commands to actuators of a bipedal robot presenting a plurality of degrees of freedom actuated by said actuators with a neural network trained using the method according to claim 1.
  • 24. A system comprising a server and a bipedal robot presenting a plurality of degrees of freedom actuated by actuators, each comprising data processing means, wherein said data processing means are respectively configured to implement the method for training a neural network for stabilizing the bipedal robot according to claim 1 and the method for stabilizing the exoskeleton according to claim 23.
  • 25. A computer program product comprising code instructions for executing the method for training a neural network for stabilizing the bipedal robot according to claim 1 or the method for stabilizing the robot according to claim 23, when said program is run on a computer.
  • 26. Storage means readable by a computer equipment on which a computer program product comprises code instructions for executing the method for training a neural network for stabilizing the bipedal robot according to claim 1 or the method for stabilizing the bipedal robot according to claim 23.
Priority Claims (1)
Number Date Country Kind
22305215.0 Feb 2022 EP regional
CROSS REFERENCE TO RELATED APPLICATIONS

This application is the 35 U.S.C. § 371 national stage application of PCT Application No. PCT/EP2023/054307, filed Feb. 21, 2023, which application claims the benefit of European Application No. EP 22305215.0 filed Feb. 25, 2022, both of which are hereby incorporated by reference herein in their entireties.

PCT Information
Filing Document Filing Date Country Kind
PCT/EP2023/054307 2/21/2023 WO