Disclosed embodiments relate generally to the field of industrial automation and control, and, more particularly, to control techniques involving an adaptively weighted combination of reinforcement learning and conventional feedback control techniques, and, even, more particularly, to robotics control system and method suitable for industrial reinforcement learning.
Conventional feedback control techniques (which may be referred throughout this disclosure as “conventional control”) can solve various types of control problems—such as without limitation, robotics control, autonomous industrial automation, etc.—This conventional control is generally accomplished by very efficiently capturing an underlying physical structure with explicit models. In one example application, this could involve an explicit definition of the body equations of motion that may be involved for controlling a trajectory of a given robot. It will be appreciated, however, that many control problems in modern manufacturing can involve various physical interactions with objects, such as may involve, without limitation, contacts, impacts, and/or friction with one or more of the objects. These physical interactions tend to be more difficult to capture with a first-order physical model. Hence, applying conventional control techniques to these situations often can result in brittle and inaccurate controllers, which, for example, have to be manually tuned for deployment. This adds to costs and can increase the time involved for robot deployment.
Reinforcement learning (RL) techniques have been demonstrated to be capable of learning continuous robot controllers involving interactions with the physical environment. However, a disadvantage commonly encountered in RL techniques, particularly those involving deep RL techniques with very expressive function approximators, may be the burdensome and time-consuming exploratory behavior, and the substantial sample inefficiency that may be involved, such as is generally the case when learning a control policy from scratch.
For an example of control techniques that may decompose the overall control strategy into a control part that is solved by conventional control techniques, and a residual control part, which is solved with RL, see the following technical papers, respectively titled: “Residual Reinforcement Learning for Robot Control” by T. Johannink; S. Bahl; A. Nair; J. Luo; A. Kumar; M. Loskyll; J. Aparicio Ojea; E. Solowjow; and S. Levine, published in arXiv:1812.03201v2 [cs.RO], 18 Dec. 2018; and “Residual Policy Learning” by T. Silver; K. Allen; J. Tenenbaum; and L. Kaelbling, published in arXiv: 1812.06298v2 [cs.RO], 3 Jan. 2019.
It will be appreciated that the approach described in the above-cited papers may be somewhat limited for broad and cost-effective industrial applicability since, for example, reinforcement learning from scratch tends to remain substantially data-inefficient and/or intractable.
The present inventors have recognized that while the basic idea of combining Reinforcement Learning (RL) with conventional control seems very promising, prior to the various innovative concepts disclosed in the present disclosure, a practical implementation in an industrial setting has remained elusive since various nontrivial, technical implementation challenges have not been fully resolved in typical prior art implementations. Some of the challenges solved by disclosed embodiments are listed below:
At least in view of the foregoing considerations, disclosed embodiments realize appropriate improvements in connection with certain known approaches involving RL (see, for example, the two technical papers cited above). It is believed that disclosed embodiments will enable practical and cost-effective industrial deployment of RL integrated with conventional control. The disclosed control approach may be referred throughout this disclosure as Industrial Residual Reinforcement Learning (IRRL).
The present inventors propose various innovative technical features to substantially improve at least certain known approaches involving RL. The following two disclosed non-limiting concepts, indicated as concept I) and concept II), underlie IRRL:
Concept I)
In a conventional Residual RL technique, a hand-designed controller may involve a rigid control strategy, and, consequently, may not be able to easily adapt to a dynamically changing environment, which, as would be appreciated by one skilled in the art, is a substantial drawback to effectively operate in such an environment. For example, in an object insertion application that may involve randomly positioned objects, the conventional controller may be a position controller. The residual RL control part, may then augment the controller for overall performance improvement. If the position controller, for example, performs a given insertion too fast (e.g., the insertion velocity is too high), the residual RL part may not be able to timely assert any meaningful influence. For example, may not be able to dynamically change the position controller. Instead, in a practical application, the residual control part should be able to appropriately influence (e.g., beneficially oppose) the control signal generated by the conventional controller. For example, if the velocity developed by the position controller is too high, then the residual RL part should be able to influence the control signal generated by the conventional controller to reduce such high velocity. To solve this fundamental problem, the present inventors propose an adaptive interaction between the respective control signals generated by the classic controller and the RL. In principle, on the one hand, the initial conventional controller should be a guiding part and not an opponent to the RL part, and, on the other hand, the RL part should be able to appropriately adapt the conventional controller.
The disclosed adaptive interaction may be as outlined below. First, the respective control signals from the two control strategies (i.e., the conventional control and the RL control) may be compared in terms of their orthogonality, such as by computing their inner product. Signal contributions toward a same projected control “direction” may be punished in a reward function. This avoids that the two control parts “fight” each other. At the same time, a disclosed algorithm can monitor whether the residual RL part has components that try to fight the conventional controller, which may be an indication of inadequacies of the conventional controller for performing a given control task. This indication may then be used to modify the conventional control law, which can either be implemented automatically or through manual adjustments.
Second, instead of constant weighting, as commonly done in a conventional residual RL control strategy, the present inventors innovatively propose adjustable weights. Without limitation, the weight adjustment may be controlled by respective contributions of the control signals towards fulfilling the reward function. The weights become functions of the rewards. This should enable a very efficient learning and smooth execution. The RL control part may be guided depending on how well it has already learned. The rationale behind this is that as soon as the RL control part is at least on par with the initial hand-designed controller, the hand-designed controller is in principle not required anymore and can be partially turned off. However, the initial hand-designed controller will still be able to contribute a control signal whenever the RL control part delivers an inferior performance for a given control task. This blending is gracefully accommodated by the adjustable weights. An analogous, simplified concept would be “bicycle support training wheels”, which may be essential during learning, but can provide support even after the learning is finished, at least during challenging situations, e.g., riding too fast when taking a sharp turn.
Concept II)
Known approaches for training Residual RL in simulation generally suffer from hit-or-miss drawbacks, mainly because the simulation is generally setup a-priori. Typically, the control policy may be solely trained in a simulation environment and only afterwards the control policy is deployed in a real-world environment. Accordingly, the actual performance based on a control policy solely trained in the simulation environment, would not be self-evident till deployed in the real-world.
Accordingly, the present inventors further propose an iterative approach, as seen in
Additionally, in the iterative approach proposed in disclosed embodiments, the simulation environment may be continuously adjusted based on real-world experience. In known approaches, as noted above, training in simulation is generally run until the simulated training is finished and then the simulated training is transferred to a physical robot in a robot roll-out. Instead, disclosed embodiments effectively interleave simulated experience and real-world experience to, for example, ensure that the simulated experience iteratively improves—in a time-efficient manner—quality and sufficiently converges towards the real-world experience. For example, a friction coefficient used in the simulation may be adjusted based on real-world measurements rendering virtual experiments more useful because the virtual experiments would become closer to mimicking the physics involved in a real-world task being performed by the robot, such as automated object insertions by the robot.
It is noted that in a practical application, simulation adjustments need not necessarily be configured for making a given simulation more realistic, but rather may be configured for achieving accelerated (time-efficient) learning. Accordingly, the physical parameters involved in a given simulation do not necessarily have to precisely converge towards the real-world parameters so long as the learning objectives may be achieved in a time-efficient manner.
The disclosed approach is an appropriately balanced way to rapidly close a simulation-to-reality gap in RL. Moreover, the disclosed approach can allow for making educated improvements to physical effects in the simulation and to quantify them in terms of their relevance for the control policy performance/improvement. For example, “How relevant is it to simulate in a given application electromagnetic forces that can develop between two objects?”. The point being that one would not want to allocate valuable simulation resources in connection with non-relevant parameters.
It will be appreciated that bringing simulation and real-world closer together may allow to appropriately tailor the sensor modalities involved in a given application. Without limitation, the disclosed approach can make evaluations about the physical environment. For example, evaluations about how accurate and/or sensitive a given sensor and/or actuator needs to be for appropriately fulfilling a desired control policy objective; or, for example, whether additional sensors and/or actuators need to be added (or whether different sensor modalities and/or actuator modalities need to be used). Without limitation, for example, the disclosed approach could additionally recommend respective locations of where to install such additional sensors and/or actuators.
In the following detailed description, various specific details are set forth in order to provide a thorough understanding of such embodiments. However, those skilled in the art will understand that disclosed embodiments may be practiced without these specific details that the aspects of the present invention are not limited to the disclosed embodiments, and that aspects of the present invention may be practiced in a variety of alternative embodiments. In other instances, methods, procedures, and components, which would be well-understood by one skilled in the art have not been described in detail to avoid unnecessary and burdensome explanation.
Furthermore, various operations may be described as multiple discrete steps performed in a manner that is helpful for understanding embodiments of the present invention. However, the order of description should not be construed as to imply that these operations need be performed in the order they are presented, nor that they are even order dependent, unless otherwise indicated. Moreover, repeated usage of the phrase “in one embodiment” does not necessarily refer to the same embodiment, although it may. It is noted that disclosed embodiments need not be construed as mutually exclusive embodiments, since aspects of such disclosed embodiments may be appropriately combined by one skilled in the art depending on the needs of a given application.
Without limitation, controller 16 may include a conventional feedback controller 18 configured to generate a conventional feedback control signal 20, and a reinforcement learning controller 22 configured to generate a reinforcement learning control signal 24.
A comparator 25 may be configured to compare orthogonality of conventional feedback control signal 20 and reinforcement learning control signal 24. Comparator 25 may be configured to supply a signal 26 indicative of orthogonality relations between conventional feedback control signal 20 and the reinforcement learning control signal 24.
Reinforcement learning controller 22 may include a reward function 28 responsive to the signal 26 indicative of the orthogonality relations between conventional feedback control signal 20 and reinforcement learning control signal 24. In one non-limiting embodiment, the orthogonality relations between conventional feedback control signal 20 and reinforcement learning control signal 24 may be determined based on an inner product of conventional feedback control signal 20 and reinforcement learning control signal 24.
In one non-limiting embodiment, orthogonality relations indicative of interdependency of conventional feedback controller signal 20 and reinforcement learning controller signal 24 are penalized by reward function 28 so that control conflicts between conventional feedback controller 18 and reinforcement learning controller 22 are avoided.
In one non-limiting embodiment, reward function 28 of reinforcement learning controller 22 may be configured to generate a stream of adaptive weights 30 based on respective contributions of conventional feedback control signal 20 and of reinforcement learning control signal 24 towards fulfilling reward function 28.
In one non-limiting embodiment, a signal combiner 32 may be configured to adaptively combine conventional feedback control signal 20 and reinforcement learning control signal 24 based on the stream of adaptive weights 30 generated by reward function 28. Without limitation, signal combiner 32 may be configured to supply an adaptively combined control signal 34 of conventional feedback control signal 20 and reinforcement learning control signal 24. The adaptively combined control signal 34 may be configured to control robot 14, as the robot performs a sequence of tasks.
Controller 16 may be configured to perform a blended control policy for conventional feedback controller 18 and reinforcement learning controller 22 to control robot 14 as the robot performs the sequence of tasks. Without limitation, the blended control policy may include robotic control modes, such as including trajectory control and interactive control of robot 14. By way of example, the interactive control of the robot may include interactions, such as may involve frictional, contact and impact interactions, that, for example, may be experienced by joints (e.g., grippers) of the robot while performing a respective task of the sequence of tasks.
Block 104 allows acquiring real-world sensor and actuator data (block 54, (
Block 106 allows extracting statistical properties of the acquired real-world sensor and actuator data. See also block 56 in
Block 108 allows extracting statistical properties of the virtual sensor and actuator data in the simulation environment. See also block 62 in
Block 110 allows adjusting—e.g., in a feedback loop 64 (
Block 112 allows applying the adjusted simulation environment to further train the baseline control policy. This would be a first iteration that may be performed in block 50 in
As indicated in block 114, based on whether or not the updated control policy fulfills desired objectives, further iterations may be performed in feedback loop 64 (
The description below will proceed to describe further non-limiting aspects that may be performed in connection with the disclosed methodology for training disclosed robotics control system 10.
As illustrated in block 120 in
As illustrated in block 140 in
As illustrated in block 160 in
For example, this may allow to appropriately tailor the real-world sensor and/or actuator modalities involved in a given application. Without limitation, the disclosed approach can make evaluations about how accurate and/or sensitive a given sensor and/or a given actuator needs to be for appropriately fulfilling a desired control policy objective; or, for example, whether additional sensors and/or additional sensors need to be added (or whether different sensor modalities and/or different actuator modalities need to be used). Without limitation, for example, the disclosed approach could additionally recommend respective locations of where to install such additional sensors and/or actuators.
As illustrated in block 180 in
In operation, disclosed embodiments allow cost-effective and reliable deployment of deep learning algorithms, such as involving deep learning RL techniques for autonomous industrial automation that may involve robotics control. Without limitation, disclosed embodiments are effective for carrying out continuous, automated robotics control, such as may involve a blended control policy that may include trajectory control and interactive control of a given robot. By way of example, the interactive control of the robot may include relatively difficult to model interactions, such as may involve frictional, contact and impact interactions, that, for example, may be experienced by joints (e.g., grippers) of the robot while performing a respective task of the sequence of tasks.
Disclosed embodiments are believed to be conducive to widespread and flexible applicability of machine learned networks for industrial automation and control that may involve automated robotics control. For example, the efficacy of disclosed embodiments may be based on an adaptive interaction between the respective control signals generated by a classic controller and an RL controller. Additionally, disclosed embodiments can make use of a machine learned framework that effectively interleaves simulated experience and real-world experience to ensure that the simulated experience iteratively improves in quality and converges towards the real-world experience. Lastly, a systematic interleaving of simulated experience and real-world experience to train a control policy in a simulator is effective to substantially reduce the required sample size compared to prior art training approaches.
While embodiments of the present disclosure have been disclosed in exemplary forms, it will be apparent to those skilled in the art that many modifications, additions, and deletions can be made therein without departing from the scope of the invention and its equivalents, as set forth in the following claims.
Filing Document | Filing Date | Country | Kind |
---|---|---|---|
PCT/US2019/053839 | 9/30/2019 | WO |