Lee, Junkyu et al., “AI Planning Annotation in Reinforcement Learning: Options and Beyond,” Aug. 5, 2021, available at:
https://people.csail.mit.edu/tommi/papers/YCZJ_EMNLP2019.pdf
The present disclosure generally relates to sequential decision-making problems, and more particularly, to sequential decision-making problems utilizing Artificial Intelligence (AI) planning and reinforcement learning (RL).
AI planning and RL are two methods that can be used to solve sequential decision-making problems. However, AI planning and RL have fundamentally different approaches in operation.
In one embodiment, a computer-implemented method of integrating an Artificial Intelligence (AI) planner and a reinforcement learning (RL) agent through AI planning annotation in RL (PaRL) includes identifying an RL problem. A description received of a Markov decision process (MDP) having a plurality of states in an RL environment is used to generate an RL task to solve the RL problem. An AI planning model described in a planning language is received, and mapping state spaces from the MDP states in the RL environment to AI planning states of the AI planning model is performed. The RL task is generated with an AI planning task from the mapping to generate a PaRL task.
In an embodiment, the identified RL problem is solved using the generated PaRL task.
In an embodiment, the PaRL task is formulated by an options framework for the MDP.
In an embodiment, one or more sets of AI plans are generated in the options framework. Options are selected from the options framework for training the RL agent by ranking the options with scores, and the selected options are sent to the RL agent.
In an embodiment, the selecting of options from the options framework is performed offline or online.
In an embodiment, the performing of a rollout option sequence with online planning includes generating a plan given trajectory, ranking options according to a scoring function, and sending the options with a highest score to the RL agent.
In an embodiment, the sending of the options to the RL agent includes guiding a sampling process by a PaRL planner to sample the options.
In an embodiment, the annotating of the RL task with the AI planning task from the mapping to generate the PaRL task includes at least one mapping selected from the group of abstraction mapping in AI planning, heuristic mapping between state spaces, and rule-based mapping.
In an embodiment, the receiving of the AI model described in the planning language is selected from the group of a Planning Domain Definition Language (PDDL), a Stanford Research Institute Problem Solver (STRIPS), a Statistical Analysis Software (SAS+), and an Action Description Language (ADL).
In an embodiment, a computer-implemented method includes producing a policy function and a probability distribution over RL environment actions per RL environment state. The policy function is produced by defining options for the RL environment based on the operators in the planning task. An initiation set of an option is defined by a set of states of the RL environment that is mapped by L to states satisfying the precondition of an action operator, and the termination set of an option is defined by the set of states of the RL environments that are mapped by L to states satisfying the effects of the action operator.
In an embodiment, the computer-implemented method includes generating a sequence of options using an AI planner from a state of the RL environment. The initial state of the planning task is obtained by mapping with L from the RL environment state. Planning algorithms are applied to generate a sequence of action operators that lead from an initial planning state to a planning goal.
In an embodiment, the producing of the policy function and the probability distribution over options per the RL environment state is performed by using a reinforcement learning algorithm.
In an embodiment, the policy function includes a set of option policy functions.
In one embodiment, a computing device configured to integrate an Artificial Intelligence (AI) planner and a reinforcement learning (RL) agent through AI planning annotation in RL (PaRL), the device includes a processor, and a memory coupled to the processor. The memory stores instructions to cause the processor to perform acts including identifying an RL problem; receiving a description of a Markov decision process (MDP) having a plurality of states in an RL environment to generate an RL task to solve the RL problem; receiving an AI planning model described in a planning language; mapping state spaces from the MDP states in the RL environment to AI planning states of the AI planning model; and annotating the RL task with an AI planning task from the mapping to generate a PaRL task.
In one embodiment, a non-transitory computer readable storage medium tangibly embodying a computer readable program code having computer readable instructions that, when executed, causes a computer device to carry out a method of integrating an Artificial Intelligence (AI) planner and a reinforcement learning (RL) agent through AI planning annotation in RL (PaRL). The method includes identifying an RL problem. A description is received of a Markov decision process (MDP) having a plurality of states in an RL environment to generate an RL task to solve the RL problem. An AI planning model described in a planning language is received. State spaces are mapped from the MDP states in the RL environment to AI planning states of the AI planning model, and the RL task is annotated with an AI planning task from the mapping to generate a PaRL task.
These and other features will become apparent from the following detailed description of illustrative embodiments thereof, which is to be read in connection with the accompanying drawings.
The drawings are of illustrative embodiments. They do not illustrate all embodiments. Other embodiments may be used in addition to or instead. Details that may be apparent or unnecessary may be omitted to save space or for more effective illustration. Some embodiments may be practiced with additional components or steps and/or without all the components or steps that are illustrated. When the same numeral appears in different drawings, it refers to the same or like components or steps.
In the following detailed description, numerous specific details are set forth by way of examples to provide a thorough understanding of the relevant teachings. However, it should be understood that the present teachings may be practiced without such details. In other instances, well-known methods, procedures, components, and/or circuitry have been described at a relatively high level, without detail, to avoid unnecessarily obscuring aspects of the present teachings.
As used herein, the term “RL problem” generally refers to a request for a best course of action to be taken given an RL agent's observation of the environment. The best course of action to an RL problem is to find/select a policy that maps a given observation to one or more of the actions to be performed. The course of action is undertaken to maximize a reward, which may be a cumulative reward. An RL task is often defined in terms of a Markov Decision Process (MDP) in the RL environment.
It is to be understood that any reference to Q-Learning does not limit the appended claims, and other reinforcement learning algorithms and policy optimization algorithms are applicable.
AI planning tasks are used to define high-level decision problems and reformulate an original “flat” RL problem via an options framework. Through the use of AI planning operators, the RL options are defined. The RL problem is annotated with an AI planning task(s). The RL problem can be reformulated as a Hierarchical RL problem.
With regard to integrating AI planners and RL agents through AI planning annotation in RL, it is to be understood that AI planning provides solutions to shortest path problems in large-scale state transition systems concisely declared by symbolic languages. As a result, AI planning is capable of quickly solving large tasks. For example, AI planning uses operator models and also provides efficient plan generation. On the other hand, RL does not require an operator model and learns a policy to guide an agent to high reward states. However, RL uses a large number of training examples to learn a policy. RL primarily addresses discounted Markov Decision Process (MDP) problems in a model-free setting. RL can be combined with deep neural networks (Deep RL) to solve problems with large-scale unstructured state spaces. While Deep RL (DRL) solves problems with large-scale unstructured state spaces, a model-free DRL operation can be sample inefficient when the reward is sparse, or in a case where the underlying model has dead ends or zero-length cycles.
As discussed herein, embodiments of the present disclosure teach a computer-implemented method and system of integrating AI planners and RL agents through AI planning annotation in RL (PaRL). PaRL includes an integrated AI planning and RL architecture to perform a Hierarchical Reinforcement Learning (HRL) approach that formulates an RL problem with a complex hierarchy of sequential decision-making problems. AI planning tasks are used to define high-level decision problems and reformulate an original “flat” RL problem via an options framework. PaRL links the state abstraction in AI planning and a temporal extraction in RL. In a case where a state space mapping assumption is common to all RL options can be defined on planning operators. The RL options can be defined through the use of planning tasks, defining a mapping between action spaces, and the use of reinforcement learning algorithms. In an embodiment, the frameworks can form an environment for problem-solving, for example, using Python, TensorFlow, etc. However, the appended claims of the disclosure are not limited to the aforementioned environments.
In an RL framework, a processor is configured to solve an RL problem 105. A description of a Markov Decision Process (MDP) 110 in the RL environment is received/retrieved. A description of a model in one of the planning languages is received. The planning languages include but are not limited to a Planning Domain Definition Language (PDDL), a Stanford Research Institute Problem Solver (STRIPS), Statistical Analysis Software SAS+, Action Description Language (ADL). A mapping L is performed to map from the MDP states of the RL environment to states of the planned model. At 115, the RL task is annotated using the AI planning task.
The embodiments of the computer-implemented method and system and method of the present disclosure provide for an improvement in the field of solving sequential decision-making problems (SDMP) in which the results have increased accuracy over the use of either AI planning or Reinforcement Learning alone. In addition, there is an improvement in computer operations as the computer-implemented method and system according to the present disclosure reduces the amount of processing power used to achieve the results, with reduced storage usage, and the results have increased accuracy. The improved efficiency of operation provides for a reduction in processing to achieve solutions to SDMP, also resulting in a savings of storage and power usage.
Additional advantages of the present architecture are disclosed herein.
With continued reference to
As shown in box 415, an option is defined for each oϵ0, where:
There is shown a repeat rollout-train until iteration limit 1001, and repeat rollout samples from a current policy 1005, which includes a repeat until rollout limit 1015. When the iteration loop begins at 1015, there's no option selected, leading to a process of generating a plan from a planning task, and choosing an option from the action operator in the plan. There is shown an RL environment 1006, in which at 1007 a reward and a state are received. At 1008, the RL state is mapped to a planning state. At 1009, a rollout option includes determining if an option is selected. If no option is selected, a planning task is defined from the current planning state, followed by generating a plan and selecting an option from the first operator in the plan or option policy.
At 1010, an action is sampled from the current policy option and stored in a buffer. At 1010, samples are generated using the selected option policy and leading to the next RL state. In the following iteration, there is an option selected, and if so, there is a check to determine whether the current option needs to be terminated by checking the termination condition, leading to selecting a new option (Yes), or continuing with the current option (No). These actions repeat in 1005 until a rollout limit is reached. The rollout limit is application dependent, and reflects the longest trajectory that the algorithm seeks to generate from the environment. Typically, a limit is larger than a length of the desired trajectory connecting the initial state to a desired goal state. A non-exhaustive example of a limit can be 1000 steps for a certain application. Finally, the option policy functions are trained, as is the SMDP policy function.
wherein H is an indicator function, and c1, c2 are negative costs. The variables are the variables in the planning task, where a planning task is defined as a set of variables, a set of operators over the variables, and the goal in
A sample trajectory is then modified (current option, state, action, reward, intrinsic reward, state’). Planning annotation with an intrinsic reward increases the sample efficiency of PaRL as compared with a flat RL case.
With the foregoing overview of the example architecture, it may be helpful now to consider a high-level discussion of an example process. To that end,
At operation 1305, an RL problem is identified. For example, the RL problem may have been formulated in terms of an environment is defined (agents, states, actions and rewards). Data collection, feature engineering, and a modeling is part of the formulation.
At operation 1310, a description is received of a Markov Decision Process (MDP) in an RL environment. The MDP description is used to generate a task to solve the RL problem.
At operation 1315, an AI planning model is received. The AI planning model is described in a planning language. A non-exhaustive example of some of the planning languages for the model are PDDL, STRIPS, SAS+, ADL.
At operation 1320, state spaces are mapped from the MDP states in the RL environment to AI planning states of the received AI planning model. Abstraction mapping, heuristic mapping between state spaces, and rule-based mapping of the MDP states to the AI planning states.
At operation 1325, the RL task is annotated with an AI planning task from the mapping to generate a PaRL task. The tasks are annotated by symbolic planning. The identified RL problem can then be solved using the PaRL task.
The computer platform 1400 may include a central processing unit (CPU) 1404, a hard disk drive (HDD) 1406, random access memory (RAM) and/or read-only memory (ROM) 1408, a keyboard 1410, a mouse 1412, a display 1414, and a communication interface 1416, which are connected to a system bus 1402. The HDD 1406 can include data stores.
In one embodiment, the HDD 1406 has capabilities that include storing a program that can execute various processes, such as machine learning, predictive modeling, classification, updating model parameters. The ML model generation module 1440 is configured to generate a machine learning model based on at least one of the generated candidate machine learning pipelines.
With continued reference to
As discussed above, functions relating to prescriptive may include a cloud. It is to be understood that although this disclosure includes a detailed description of cloud computing as discussed herein below, the implementation of the teachings recited herein is not limited to a cloud computing environment. Rather, embodiments of the present disclosure are capable of being implemented in conjunction with any other type of computing environment now known or later developed.
Cloud computing is a model of service delivery for enabling convenient, on-demand network access to a shared pool of configurable computing resources (e.g., networks, network bandwidth, servers, processing, memory, storage, applications, virtual machines, and services) that can be rapidly provisioned and released with minimal management effort or interaction with a provider of the service. This cloud model may include at least five characteristics, at least three service models, and at least four deployment models.
On-demand self-service: a cloud consumer can unilaterally provision computing capabilities, such as server time and network storage, as needed automatically without requiring human interaction with the service's provider.
Broad network access: capabilities are available over a network and accessed through standard mechanisms that promote use by heterogeneous thin or thick client platforms (e.g., mobile phones, laptops, and PDAs).
Resource pooling: the provider's computing resources are pooled to serve multiple consumers using a multi-tenant model, with different physical and virtual resources dynamically assigned and reassigned according to demand. There is a sense of location independence in that the consumer generally has no control or knowledge over the exact location of the provided resources but may be able to specify location at a higher level of abstraction (e.g., country, state, or datacenter).
Rapid elasticity: capabilities can be rapidly and elastically provisioned, in some cases automatically, to quickly scale out and rapidly released to quickly scale in. To the consumer, the capabilities available for provisioning often appear to be unlimited and can be purchased in any quantity at any time.
Measured service: cloud systems automatically control and optimize resource use by leveraging a metering capability at some level of abstraction appropriate to the type of service (e.g., storage, processing, bandwidth, and active user accounts). Resource usage can be monitored, controlled, and reported, providing transparency for both the provider and consumer of the utilized service.
Software as a Service (SaaS): the capability provided to the consumer is to use the provider's applications running on a cloud infrastructure. The applications are accessible from various client devices through a thin client interface such as a web browser (e.g., web-based e-mail). The consumer does not manage or control the underlying cloud infrastructure including network, servers, operating systems, storage, or even individual application capabilities, with the possible exception of limited user-specific application configuration settings.
Platform as a Service (PaaS): the capability provided to the consumer is to deploy onto the cloud infrastructure consumer-created or acquired applications created using programming languages and tools supported by the provider. The consumer does not manage or control the underlying cloud infrastructure including networks, servers, operating systems, or storage, but has control over the deployed applications and possibly application hosting environment configurations.
Infrastructure as a Service (IaaS): the capability provided to the consumer is to provision processing, storage, networks, and other fundamental computing resources where the consumer is able to deploy and run arbitrary software, which can include operating systems and applications. The consumer does not manage or control the underlying cloud infrastructure but has control over operating systems, storage, deployed applications, and possibly limited control of select networking components (e.g., host firewalls).
Private cloud: the cloud infrastructure is operated solely for an organization. It may be managed by the organization or a third party and may exist on-premises or off-premises.
Community cloud: the cloud infrastructure is shared by several organizations and supports a specific community that has shared concerns (e.g., mission, security requirements, policy, and compliance considerations). It may be managed by the organizations or a third party and may exist on-premises or off-premises.
Public cloud: the cloud infrastructure is made available to the general public or a large industry group and is owned by an organization selling cloud services.
Hybrid cloud: the cloud infrastructure is a composition of two or more clouds (private, community, or public) that remain unique entities but are bound together by standardized or proprietary technology that enables data and application portability (e.g., cloud bursting for load-balancing between clouds).
A cloud computing environment is service-oriented with a focus on statelessness, low coupling, modularity, and semantic interoperability. At the heart of cloud computing is an infrastructure that includes a network of interconnected nodes.
Referring now to
Referring now to
Hardware and software layer 1660 include hardware and software components. Examples of hardware components include: mainframes 1661; RISC (Reduced Instruction Set Computer) architecture-based servers 1662; servers 1663; blade servers 1664; storage devices 1665; and networks and networking components 1666. In some embodiments, software components include network application server software 1667 and database software 1668.
Virtualization layer 1670 provides an abstraction layer from which the following examples of virtual entities may be provided: virtual servers 1671; virtual storage 1672; virtual networks 1673, including virtual private networks; virtual applications and operating systems 1674; and virtual clients 1675.
In one example, management layer 1680 may provide the functions described below. Resource provisioning 1681 provides dynamic procurement of computing resources and other resources that are utilized to perform tasks within the cloud computing environment. Metering and Pricing 1682 provide cost tracking as resources are utilized within the cloud computing environment, and billing or invoicing for consumption of these resources. In one example, these resources may include application software licenses. Security provides identity verification for cloud consumers and tasks, as well as protection for data and other resources. User portal 1683 provides access to the cloud computing environment for consumers and system administrators. Service level management 1684 provides cloud computing resource allocation and management such that required service levels are met. Service Level Agreement (SLA) planning and fulfillment 1685 provide pre-arrangement for, and procurement of, cloud computing resources for which a future requirement is anticipated in accordance with an SLA.
Workloads layer 1690 provides examples of functionality for which the cloud computing environment may be utilized. Examples of workloads and functions which may be provided from this layer include: mapping and navigation 1691; software development and lifecycle management 992; virtual classroom education delivery 1693; data analytics processing 1694; transaction processing 1695; a PaRL module 1696 configured to integrate AI planning and RL architecture to perform a Hierarchical Reinforcement Learning (HRL) approach that formulates an RL problem with a complex hierarchy of sequential decision-making problems. AI planning tasks are used to define high-level decision problems and reformulate an original “flat” RL problem via an options framework. PaRL links the state abstraction in AI planning and a temporal extraction in RL, as discussed herein above.
The descriptions of the various embodiments of the present teachings have been presented for purposes of illustration, but are not intended to be exhaustive or limited to the embodiments disclosed. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the described embodiments. The terminology used herein was chosen to best explain the principles of the embodiments, the practical application or technical improvement over technologies found in the marketplace, or to enable others of ordinary skill in the art to understand the embodiments disclosed herein.
While the foregoing has described what are considered to be the best state and/or other examples, it is understood that various modifications may be made therein and that the subject matter disclosed herein may be implemented in various forms and examples, and that the teachings may be applied in numerous applications, only some of which have been described herein. It is intended by the following claims to claim any and all applications, modifications, and variations that fall within the true scope of the present teachings.
The components, operations, steps, features, objects, benefits, and advantages that have been discussed herein are merely illustrative. None of them, nor the discussions relating to them, are intended to limit the scope of protection. While various advantages have been discussed herein, it will be understood that not all embodiments necessarily include all advantages. Unless otherwise stated, all measurements, values, ratings, positions, magnitudes, sizes, and other specifications that are set forth in this specification, including in the claims that follow, are approximate, not exact. They are intended to have a reasonable range that is consistent with the functions to which they relate and with what is customary in the art to which they pertain.
Numerous other embodiments are also contemplated. These include embodiments that have fewer, additional, and/or different components, steps, features, objects, benefits and advantages. These also include embodiments in which the components and/or steps are arranged and/or ordered differently.
The flowchart, and diagrams in the figures herein illustrate the architecture, functionality, and operation of possible implementations according to various embodiments of the present disclosure.
While the foregoing has been described in conjunction with exemplary embodiments, it is understood that the term “exemplary” is merely meant as an example, rather than the best or optimal. Except as stated immediately above, nothing that has been stated or illustrated is intended or should be interpreted to cause a dedication of any component, step, feature, object, benefit, advantage, or equivalent to the public, regardless of whether it is or is not recited in the claims.
It will be understood that the terms and expressions used herein have the ordinary meaning as is accorded to such terms and expressions with respect to their corresponding respective areas of inquiry and study except where specific meanings have otherwise been set forth herein. Relational terms such as first and second and the like may be used solely to distinguish one entity or action from another without necessarily requiring or implying any such actual relationship or order between such entities or actions. The terms “comprises,” “comprising,” or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. An element proceeded by “a” or “an” does not, without further constraints, preclude the existence of additional identical elements in the process, method, article, or apparatus that comprises the element.
The Abstract of the Disclosure is provided to allow the reader to quickly ascertain the nature of the technical disclosure. It is submitted with the understanding that it will not be used to interpret or limit the scope or meaning of the claims. In addition, in the foregoing Detailed Description, it can be seen that various features are grouped together in various embodiments for the purpose of streamlining the disclosure. This method of disclosure is not to be interpreted as reflecting an intention that the claimed embodiments have more features than are expressly recited in each claim. Rather, as the following claims reflect, the inventive subject matter lies in less than all features of a single disclosed embodiment. Thus, the following claims are hereby incorporated into the Detailed Description, with each claim standing on its own as a separately claimed subject matter.