The present disclosure generally relates to a system and method for implementing imitation learning in a manufacturing environment.
Since the dawn of the industrial revolution in the 18th century, automation has governed the production of goods. Although today's factories have fully embraced automation as a core principle—with robots performing many repeatable tasks in high-production environments—many assembly tasks continue to be performed by humans. These tasks are difficult to automate due to cost, risk of critical failure, or logistics of deploying a robotic system to perform a task.
A computing system identifies a trajectory example generated by a human operator. The trajectory example includes trajectory information of the human operator while performing a task to be learned by a control system of the computing system. Based on the trajectory example, the computing system trains the control system to perform the task exemplified in the trajectory example. Training the control system includes generating an output trajectory of a robot performing the task. The computing system identifies an updated trajectory example generated by the human operator based on the trajectory example and the output trajectory of the robot performing the task. Based on the updated trajectory example, the computing system continues to train the control system to perform the task exemplified in the updated trajectory example.
In some embodiments, a non-transitory computer readable medium is disclosed herein. The non-transitory computer readable medium includes one or more sequences of instructions, which, when executed by a processor, causes a computing system to perform operations. The operations include identifying, by the computing system, a trajectory example generated by a human operator. The trajectory example includes trajectory information of the human operator while performing a task to be learned by a control system of the computing system. The operations further include, based on the trajectory example, training, by the computing system, the control system to perform the task exemplified in the trajectory example. Training the control system includes generating an output trajectory of a robot performing the task. The operations further include identifying, the computing system, an updated trajectory example generated by the human operator based on the trajectory example and the output trajectory of the robot performing the task. The operations further include, based on the updated trajectory example, continuing, by the computing system, to train the control system to perform the task exemplified in the updated trajectory example.
In some embodiments, a system is disclosed herein. The system includes a processor and a memory. The memory has programming instructions stored thereon, which, when executed by the processor, causes the system to perform operations. The operations include identifying a trajectory example generated by a human operator. The trajectory example includes trajectory information of the human operator while performing a task to be learned by a control system. The operations further include, based on the trajectory example, training the control system to perform the task exemplified in the trajectory example. Training the control system includes generating an output trajectory of a robot performing the task. The operations further include identifying an updated trajectory example generated by the human operator based on the trajectory example and the output trajectory of the robot performing the task. The operations further include, based on the updated trajectory example, continuing to train the control system to perform the task exemplified in the updated trajectory example.
So that the manner in which the above recited features of the present disclosure can be understood in detail, a more particular description of the disclosure, briefly summarized above, may be had by reference to embodiments, some of which are illustrated in the appended drawings. It is to be noted, however, that the appended drawings illustrate only typical embodiments of this disclosure and are therefore not to be considered limiting of its scope, for the disclosure may admit to other equally effective embodiments.
To facilitate understanding, identical reference numerals have been used, where possible, to designate identical elements that are common to the figures. It is contemplated that elements disclosed in one embodiment may be beneficially utilized on other embodiments without specific recitation.
Imitation learning is a growing umbrella of methods to train robotic systems to closely mimic human activity in operation, allowing for encoding of motion to be done fast, more generically, and for existing processes. Imitation learning escapes from the traditional translation step of converting human based standard operating protocols into robotic programming logic. This translation is imperfect, however, and carries the intrinsic difficulty of identifying the principal motion components in each action frame and determining the intention behind specific motions.
For industrial applications, the action space is narrower than, for example, it is with autonomous driving; however, the need to achieve higher efficiency and yield is greater for industrial applications. In the industrial context, for example, the goal may be to encode robotic motion from existing or actively collected processes, such that the robot can perform the same procedure without the need for human intervention on the production line. Moreover, the goal is to prevent the need for downstream or upstream protocols to be altered or an exception to be added to the product check-out. The process of enabling the robotic system to be a drop-in for human operators may require high fidelity of operation and the accounting for many conditions that might not be intrinsically known. Though simulation environments may be effective in the development of learning strategies, they are ill suited to capture the myriad of variables associated with real-world processes and are of limited use for practical improvement of ground truth procedures.
To account for the current limitations in imitation learning, one or more techniques described herein provide a new approach to the foregoing set of problems by forcing a human operator (e.g., a teacher) to be dynamic in the inherent action space and to adjust their operation protocol (SOP) according to the limitations of the robotic operator (e.g., a student). The present approach may use the guidance of the traditional teacher-student relationship, by which the unsuccessful teacher is defined as one who “teaches to themselves” and assumes a set of skills, prior knowledge, and experience that the actual student does not possess. Instead, a successful teacher-student relationship may be defined as the same teacher adjusting their procedures and approach based on the performance of the student and the intuitive understanding of the student's limitations.
In some embodiments, the one or more techniques described herein may maintain the overarching static inputs and outputs of a processing station, as defined by the SOP, and may provide feedback to the human operator to change their approach to the outlined tasks. Fluid actions, such as the angle of approach to grasp a tool, the type of tool used, the order of operations, and other parts of the process that are performed through habit and not as important as strategic items, may be altered without sacrificing efficiency, while also improving the learning rate and fidelity of the robotic operations.
In some embodiments, an operator may iteratively remap its own action space onto the action space of the robotic system. The hybrid active-teaching/active learning approach may suggest a way to continuously improve the performance of the robotic system concurrently with, and co-dependent upon, continuously adapting the human operator's motion profile to more closely match the intrinsic limitations of the robotic system.
Such approach may create a dual-reward structure, by which the operator may be rewarded for the completion of their primary task—as measured by adherence and compliance to the SOP—and the ability of their actions to be understood by the robotic manipulator. By continuously providing feedback from the robotic system in training to the operator, the present approach seeks to apply direct changes to the operator behavior to reduce the environmental noise and classification problems that limit successful robotic training. The present process may be used for the introduction of robotic manipulators into existing production lines for auxiliary operations, improving handover tasks, and retraining for new products and procedures. The present approach does not necessitate robotic replacement of operators, but instead facilitates the efficient use of both resources by streamlining the training process.
Teacher computing system 102 may be operated by a user or administrator. For example, teacher computing system 102 may be a mobile device, a tablet, a desktop computer, or any computing system having the capabilities described herein. More specifically, teacher computing system 102 may be associated with a “teacher” in a teacher-student imitation learning paradigm.
Teacher computing system 102 may include at least data collection module 106 and training module 108. Each of data collection module 106 and training module 108 may be comprised of one or more software modules. The one or more software modules may be collections of code or instructions stored on a media (e.g., memory of computing systems associated with teacher computing system 102) that represent a series of machine instructions (e.g., program code) that implements one or more algorithmic steps. Such machine instructions may be the actual computer code the processor interprets to implement the instructions or, alternatively, may be a higher level of coding of the instructions that is interpreted to obtain the actual computer code. The one or more software modules may also include one or more hardware components. One or more aspects of an example algorithm may be performed by the hardware components (e.g., circuitry) itself, rather as a result of the instructions
Data collection module 106 may be configured to receive or retrieve data from one or more sensors 110 in communication with teacher computing system 102. For example, sensors 110 may be configured to monitor a human operator, or teacher, performing a process to be learned by student computing system 104. Using more specific example, sensors 110 may be configured to monitor a human operator placing a substrate on a pedestal of an inspection system. Sensors 110 may collect various data related to the process, such as, but not limited to images or videos of the human operator, angles related to the joints of the human operator, motion data related to the human operator, and the like. Generally, sensors 110 may collect data related to a human operator performing the same process multiple times in multiple different ways.
Training module 108 may be configured to train a neural network for deployment to student computing system 104. For example, training module 108 may be configured to generate a teacher policy from which student computing system 104 may learn to perform the target process.
Student computing system 104 may be associated with an agent. In some embodiments, an agent may refer to a robot. Student computing system 104 may be representative of a mobile device, a tablet, a desktop computer, or any computing system having the capabilities described herein. More specifically, student computing system 104 may be associated with a “student” in a teacher-student imitation learning paradigm.
Student computing system 104 may include at least data collection module 116 and control system 118. Each of data collection module 116 and control system 118 may be comprised of one or more software modules. The one or more software modules may be collections of code or instructions stored on a media (e.g., memory of computing systems associated with student computing system 104) that represent a series of machine instructions (e.g., program code) that implements one or more algorithmic steps. Such machine instructions may be the actual computer code the processor interprets to implement the instructions or, alternatively, may be a higher level of coding of the instructions that is interpreted to obtain the actual computer code. The one or more software modules may also include one or more hardware components. One or more aspects of an example algorithm may be performed by the hardware components (e.g., circuitry) itself, rather as a result of the instructions.
Data collection module 116 may be configured to receive or retrieve data from one or more sensors 112 in communication with student computing system 104. For example, sensors 112 may be configured to monitor an agent, or robot, executing a process inferred from the teacher's process. Using more specific example, sensors 112 may be configured to monitor a robot imitating the human operator's steps of placing a substrate on a pedestal of an inspection system. Sensors 112 may collect various data related to the process, such as, but not limited to images or videos of the agent, angles related to components of the agent, motion data related to the agent, and the like.
Control system 118 may be configured to learn how to imitate the process performed by the teacher. Control system 118 may instruct, for example, a robot to perform certain actions based on data received or retrieved by sensors 112 during operation. For example, based on an indication that the robot picked up a substrate from sensors 112, control system 118 may send a signal to the robot to perform the next step in the process of placing the substrate on a pedestal of the inspection system.
As provided above, teacher computing system 102 and student computing system 104 may employ a bidirectional teacher student training process. For example, the learning process may include a forward component and a backwards component. The forward component may accomplish translation from the teacher's behavior, expressed in the teacher's native action space, into the student's behavior, expressed in the student's native action space. The backward component may provide feedback from the student to the teacher that iteratively modifies the training dataset the teacher may provide to the student, which unlocks a state subspace for the student to discover.
As part of the bidirectional teacher student training process, student computing system 104 may employ a forward-backward-DAGGER algorithm based on the original DAGGER algorithm. The forward-backward-DAGGER algorithm improves upon the original DAGGER algorithm by addressing a disparity between a higher-dimensional action space of the teacher and the lower-dimensional action space of the student, given that the teacher can only teach or provide an expert policy in its higher-dimensional action space, which is true in most real-world situations.
In forward-backward-DAGGER, the teacher and student policies, embedded in their respective action spaces, are encoded in function approximators. An initial student policy, for example, at random from those policies having the appropriate action space.
To train student computing system 104 to learn and perform well in the environment with their constrained action space, forward-backward-DAGGER may guide aggregation of data in the student's action space. This aggregated data can be used to improve student computing system 104. Critically, feedback from the student may be reported to the teacher. This feedback may assist the teacher to guide the student to aggregate new, better data in future iterations, which may guide the student's learning.
In the context of industrial robotics control, humans are known to produce actions in a high-dimensional actions space. When they attempt to teach a robot a specific task (e.g., to open a can of soda), humans perform sequences of actions that are closer to the robot's capability in reproducing those actions. Often, the robot's actions fall significantly short of reproducing the expected human behavior. This is because the human's actions are from a higher-dimensional action space that is often discretized into dimensions closer to those of the robot's action space.
To assist the robot to learn the human's original behavior, the human needs to modify its own behavior. This process requires feedback from the robot. The resulting modified human behavior, when shared with the robot, can guide the robot to behaviors that more successfully imitate the human.
In the current context, student computing system 104 may provide teacher computing system 102 with feedback, such that teacher computing system 102 can generate a policy or perform a series of steps that more closely aligns to actions or behaviors reproducible in the action space of student computing system 104.
At a high level, forward-backward-DAGGER is intended to provide value when the action spaces of the student and the teacher are distinct. For every possible trajectory sampled from the environment, a reward may be observed similar to rewards in a reinforcement learning problem. The sum of rewards of the whole trajectory can be calculated using R.
The expert policy with high-dimensional action space may be encoded using a function approximator. The student policy at a given iteration may be initialized using the expert policy and the learning observed by the student policy in the previous iteration. The student policy, when updated, may be used to sample a new trajectory that is aggregated in a data buffer. A new understanding of this data buffer may be approximated using a classifier. In some embodiments, the classifier may be representative of a scoring model/value function of the state action trajectory.
A Gaussian of the expert policy distribution may be used as the next version of the expert policy network. This step may act as an exploration aspect that may help the student. This cycle may be repeated until a student who is closer to the initial expert policy, while also securing higher rewards, is chosen as the best candidate who can perform well using the lower-dimensional action space.
The guidance provided by the expert policy at every iteration may be using a Gaussian distribution implemented as a dropout function on the neural network. The Gaussian distribution can be replaced with any other function that can cause exploration of the expert policy.
In some embodiments, forward-backward-DAGGER may be implemented as a single graph containing four different neural networks: π*1, π*i, πi, and {circumflex over (π)}i.
An initial expert policy network, π*1, may be generated by collecting initial expert trajectory data from a human and encoding that initial expert trajectory data into the initial expert policy network, π*1. For example, in some embodiments, sensors 110 may capture initial expert trajectory data from a human performing a target process. Teacher computing system 102 may provide the initial expert trajectory data to student computing system 104. In some embodiments, rather than sensors 110 providing the initial expert trajectory data to teacher computing system 102, sensors 110 may provide the initial expert trajectory data directly to student computing system 104. The initial expert policy network, π*1, may be stored in a memory associated with student computing system 104. For example, to facilitate easier and faster weight transfer, the weights of initial expert policy network, π*1 may be held in memory. In some embodiments, the reproducibility of the expert trajectory by the expert policy may be validated before application in forward-backward-DAGGER.
A current expert policy, π*i, may be initialized with the initial expert policy, π*1. At any given iteration, current expert policy, π*i, may be evolved to share the weight information from both the previous iteration of current expert policy, π*i, and the initial expert policy, π*1. Such approach may provide a collective understanding of the exploration done by the current expert policy while still revolving around the initial expert policy, π*1, dictated by a human teacher. Additionally, current expert policy, π*i, may be the key generator of exploratory trajectories that are helping the student learn the environment in its low dimensionality actions space.
Current policy, πi, may be representative of the student policy. For example, current policy, πi, may be used by control system 118 for adapting and interacting with the environment. In some embodiments, current policy, πi, may inherit information from the exploratory current expert policy, π*i, and the generalized current policy, {circumflex over (π)}i. Current policy, πi, may be designed to sample the environment and participate in the data aggregation process. However, current policy, πi does not generate any human or expert labels, thus facilitating the automation of the forward-backward-DAGGER algorithm.
Generalized current policy, {circumflex over (π)}i, may be representative of the actual learner that generalizes on the aggregated dataset. The best performing network upon validation will be returned as the student who performs well in its low-dimensional state.
In operation, it is generally assumed that the teacher and student have distinct action spaces. For purposes of this discussion, the teacher policy may be represented as π* and the student policy mat be represented as {circumflex over (π)}. Accordingly, the action space of the teacher may be represented as (π*); the action space of the student may be represented as ({circumflex over (π)}). To validate a given policy, a sum-of-rewards function, R, may be used. As input, the sum-of-rewards function, R, may receive a data set of trajectories induced by that policy; as output, the sum-of-rewards function, R, may be a real number sum-of-rewards accrued to the input trajectories. In a given policy setting, an administrator or operator may know, a priori, the minimum and maximum possible sum-of-rewards, Rmin and Rmax.
The process may begin at step 402 with the initial teacher policy, π*. The initial teacher policy may include two mixture parameters, α, γ ∈ ∩ [0,1]. The α parameter may control how the current teacher policy, π*i, and current student policy, {circumflex over (π)}i, are combined into the current hybrid policy, πi. The α parameter may be used as the base in an exponential decay function, βi=αi−1, that may specify the precipitously decreasing significance of the teacher policy, π*i, in the hybrid policy, πi, relative to the student policy, {circumflex over (π)}i, from one iteration to the next. The γ parameter may linearly control how the respective initial teacher policy, π*1, and current teacher policy, π*i, may be combined into the next teacher policy, π*i+1.
At step 404, control system 118 may initialize the trajectory data set, D, to empty. Control system 118 may set the initial teacher policy, π*1, and the initial student policy, {circumflex over (π)}1 from the set of student policies, {circumflex over (Π)}.
At step 406, control system 118 may learn how to perform a task as conveyed by the initial teacher policy, π*. For example, Control system 118 may iterate N times over the following, indexing on i, where i may have the initial value of 1. Step 406 may include steps 408-422.
At step 408, control system 118 may compute the current exponential mixture coefficient, βi, where βi=αi−1. At step 410, control system 118 may then create the current hybrid teacher-student policy, πi, from an exponentially weighted mixture of the current teacher policy, βiπ*i, and the current student policy, (1−β){circumflex over (π)}i.
At step 412, control system 118 may create a set of T-step trajectories induced by the current hybrid policy, π
At step 418, control system 118 may create the next student policy, {circumflex over (π)}i+1, from the classifier trained on D. Let ρ ∈ Π [0,1] be the normalized sum-of-rewards based on Di, which may be treated as a probability signifying how successful the current hybrid policy, πi, imitated the teacher's original policy, π*1. Let ϕ be a sigmoid function of
that smoothly decreases from 1→0 so that it is well behaved as ρ→Rmin.
At step 420, control system 118 may create the current hybrid original-teacher-current-teacher policy {tilde over (π)}* from a linearly weighted mixture of the initial teacher policy, γπ*1, and the current teacher policy, (1−γ)π*i. This formulation may allow one to specify the persistence of the original teacher policy in the evolution of the teacher policy.
At step 422, control system 118 may create the next teacher policy, π*i+1, by randomly selecting from the space around the current hybrid original-teacher-current-teacher policy, {tilde over (π)}, on the basis of how well the current hybrid teacher-student policy, πi, performed, quantified as ρ and ϕ, described above, and characterized by a Gaussian-sampled multiplier (1,ϕ). In some embodiments, the Gaussian-sampled multiplier may be representative of the addition of injected noise into the teacher policy. For example, at every iteration of i, the teacher policy π*i, can be injected with Gaussian noise. For example, data sampling using πi=βiπ*i+(1−βi){circumflex over (π)}i is expected to overcome issues related to instability. The Gaussian noise injected into the teacher can be expected to cause exploratory effects on the student policy since data aggregation may be done using πi=βiπ*i+(1−βi){circumflex over (π)}i. These exploratory effects can be expected to cause the student policy, constrained by its own action space dimensionality, to visit states whose P(s ∈S|{circumflex over (π)}i) is low.
Once the iteration on i from 1 to N is complete, at step 424, control system 118 may compute the index value, i*, whose trajectory data set, Di*, gives the maximum sum-of-rewards, R(Di*). At step 426, control system 118 may return the teacher and student policies whose index is i*:{π*i,{circumflex over (π)}*i}.
As shown, at step 602, student and teacher policies may be initialized for a starting state. In some embodiments, the student and teacher policies are assumed to operate over distinct action spaces.
At step 604, the student may iterate over the student's environment a set number of times. For each iteration, the student may gather state-action-reward tuples for posterity.
At step 606, the student policy may be optimized over the maximal reward.
At step 608, a trajectory is created for both the student and teacher for the environment.
At step 610, for all actions of the trajectory that follow an erroneous action from the student compared to the teacher, the weights of the student policy may be optimized to minimize the deviation from the teacher trajectory.
At step 612, the system may iterate the next teacher policy. Such iteration may continue until a best or optimal teacher-student policy is returned.
To enable user interaction with the computing system 700, an input device 745 may represent any number of input mechanisms, such as a microphone for speech, a touch-sensitive screen for gesture or graphical input, keyboard, mouse, motion input, speech and so forth. An output device 735 (e.g., display) may also be one or more of a number of output mechanisms known to those of skill in the art. In some instances, multimodal systems may enable a user to provide multiple types of input to communicate with computing system 700. Communications interface 740 may generally govern and manage the user input and system output. There is no restriction on operating on any particular hardware arrangement and therefore the basic features here may easily be substituted for improved hardware or firmware arrangements as they are developed.
Storage device 730 may be a non-volatile memory and may be a hard disk or other types of computer readable media which may store data that are accessible by a computer, such as magnetic cassettes, flash memory cards, solid state memory devices, digital versatile disks, cartridges, random access memories (RAMs) 725, read only memory (ROM) 720, and hybrids thereof.
Storage device 730 may include services 732, 734, and 736 for controlling the processor 710. Other hardware or software modules are contemplated. Storage device 730 may be connected to system bus 705. In one aspect, a hardware module that performs a particular function may include the software component stored in a computer-readable medium in connection with the necessary hardware components, such as processor 710, bus 705, output device 735, and so forth, to carry out the function.
Chipset 760 may also interface with one or more communication interfaces 790 that may have different physical interfaces. Such communication interfaces may include interfaces for wired and wireless local area networks, for broadband wireless networks, as well as personal area networks. Some applications of the methods for generating, displaying, and using the GUI disclosed herein may include receiving ordered datasets over the physical interface or be generated by the machine itself by processor 755 analyzing data stored in storage device 770 or RAM 775. Further, the machine may receive inputs from a user through user interface components 785 and execute appropriate functions, such as browsing functions by interpreting these inputs using processor 755.
It may be appreciated that example systems 700 and 750 may have more than one processor 710 or be part of a group or cluster of computing devices networked together to provide greater processing capability.
While the foregoing is directed to embodiments described herein, other and further embodiments may be devised without departing from the basic scope thereof. For example, aspects of the present disclosure may be implemented in hardware or software or a combination of hardware and software. One embodiment described herein may be implemented as a program product for use with a computer system. The program(s) of the program product define functions of the embodiments (including the methods described herein) and can be contained on a variety of computer-readable storage media. Illustrative computer-readable storage media include, but are not limited to: (i) non-writable storage media (e.g., read-only memory (ROM) devices within a computer, such as CD-ROM disks readably by a CD-ROM drive, flash memory, ROM chips, or any type of solid-state non-volatile memory) on which information is permanently stored; and (ii) writable storage media (e.g., floppy disks within a diskette drive or hard-disk drive or any type of solid state random-access memory) on which alterable information is stored. Such computer-readable storage media, when carrying computer-readable instructions that direct the functions of the disclosed embodiments, are embodiments of the present disclosure.
It will be appreciated to those skilled in the art that the preceding examples are exemplary and not limiting. It is intended that all permutations, enhancements, equivalents, and improvements thereto are apparent to those skilled in the art upon a reading of the specification and a study of the drawings are included within the true spirit and scope of the present disclosure. It is therefore intended that the following appended claims include all such modifications, permutations, and equivalents as fall within the true spirit and scope of these teachings.
This application claims priority to U.S. Provisional Application Ser. No. 63/153,811, filed Feb. 25, 2021, which is hereby incorporated by reference in its entirety.
Number | Name | Date | Kind |
---|---|---|---|
9592608 | Bingham | Mar 2017 | B1 |
10842578 | Hashimoto et al. | Nov 2020 | B2 |
11203362 | Beijbom et al. | Dec 2021 | B1 |
20120139925 | Wang et al. | Jun 2012 | A1 |
20150306761 | O'Connor | Oct 2015 | A1 |
20170190051 | O'Sullivan et al. | Jul 2017 | A1 |
20190091859 | Wen | Mar 2019 | A1 |
20190135300 | Gonzalez Aguirre | May 2019 | A1 |
20190291277 | Oleynik | Sep 2019 | A1 |
20200384639 | Rozo | Dec 2020 | A1 |
20200409379 | Takahashi | Dec 2020 | A1 |
20220105624 | Kalakrishnan | Apr 2022 | A1 |
20230089978 | Pulver | Mar 2023 | A1 |
Number | Date | Country |
---|---|---|
2010005761 | Jan 2010 | JP |
2010201611 | Sep 2010 | JP |
2018051652 | Apr 2018 | JP |
2021009466 | Jan 2021 | JP |
Entry |
---|
PCT International Application No. PCT/US22/17943, International Search Report and Written Opinion of the International Searching Authority, dated Jun. 9, 2022, 9 pages. |
Demura S., et al., “An Acquisition of Trajectory Generation Method for Tool Operation Based on MITATE from Human Demonstration,” Proceedings of the 2018 JSME Conference on Robotics and Mechatronics Conference, Jun. 2018, 4 pages. |
Office Action for Japanese Patent Application No. 2023-551134, mailed Oct. 11, 2024, 8 pages. |
Number | Date | Country | |
---|---|---|---|
20220269254 A1 | Aug 2022 | US |
Number | Date | Country | |
---|---|---|---|
63153811 | Feb 2021 | US |