The present disclosure relates to methods for planning in uncertain conditions, and more particularly to solving Delayed observation Partially Observable Markov Decision Processes (D-POMDPs).
Recently, there has been in increase in interest in autonomous agents deployed in domains ranging from automated trading, traffic control, disaster rescue and space exploration. Delayed observation reasoning is particularly relevant in providing real time decisions based on traffic congestion/incident information, in making decisions on new products before receiving the market response to a new product, etc. Similarly in therapy planning, in some cases, a patient's treatment has to continue even if patient's response to a medicine is not observed immediately. Delays in receiving such information can be due to data fusion, computation, transmission and physical limitations of the underlying process.
Attempts to solve problems having delayed observations and delayed reward feedback have been designed to provide sufficient statistic and theoretical guarantees on the solution quality for static and randomized delays. Although the theoretical properties are important, an approach based on using sufficient statistic is not scalable.
According to an exemplary embodiment of the present invention, a method for determining a policy that considers observations delayed at runtime is disclosed. The method includes constructing a model of a stochastic decision process that receives delayed observations at run time, wherein the stochastic decision process is executed by an agent, finding an agent policy according to a measure of an expected total reward of a plurality of agent actions within the stochastic decision process over a given time horizon, and bounding an error of the agent policy according to an observation delay of the received delayed observations.
Preferred embodiments of the present invention will be described below in more detail, with reference to the accompanying drawings:
According to an exemplary embodiment of the present invention, methods are described for a parameterized approximation for solving Delayed observation Partially Observable Markov Decision Processes (D-POMDPs) with a desired accuracy. A policy execution technique is described that adjusts an agent policy corresponding to delayed observations at run-time for improved performance.
Exemplary embodiments of the present invention are applicable to various fields, for example, food safety testing (e.g., testing for pathogens) and communications, and more generally to Markov decision processes with delayed state observations. In the field of food safety testing sequential testing can be inaccurate, test results arrive with delays and a testing period is finite. In the field of communications, within dynamic environments, communication messages can be lost or arrive with delays.
A Partially Observable Markov Decision Process (POMDP) describes a case wherein an agent operates in an environment where the outcomes of agent actions are stochastic and the state of the process is only partially observable to the agent. A POMDP is a tuple S, A, Ω, P, R, O, where S is the set of process states, A is the set of agent actions and Ω is the set of agent observations. P(s′|a,s) is the probability that the process transitions from state sεS to state s′εS when the agent executes action aεA, while O(ω|a, s′) is the probability that the observation that reaches the agent is wεΩ. R(s,a) is the immediate reward that the agent receives when it executes action a in state s. Rewards can include a cost of a given action, in addition to any benefit or penalty associated with the action.
A POMDP policy π:B×TεA can be defined as a mapping from agent belief states bεB at decision epochs tεT to agent actions aεA. An agent belief state b=(b(s))sεS is the agent belief about the current state of the system. To solve a POMDP, a policy π* is found that increases (e.g., maximizes) the expected total reward of the agent actions (=sum of its immediate rewards) over a given time horizon T.
According to an exemplary embodiment of the present invention, a D-POMDP model allows for modeling of delayed observations. A D-POMDP is a tuple S, A, Ω, P, R, O, χ, wherein χ is a set of random variables χs,a(k) that specify the probability that an observation is delayed by k decision epochs, when action a is executed in state s. An example of χs,a would be the discrete distribution (0.5, 0.3, 0.2), where 0.5 represents no delay, 0.3 represents one time step delay and 0.2 represents two time step delay in receiving the observation in state s on executing action a. D-POMDPs extend POMDPs by modeling the observations that are delayed and by allowing for actions to be executed prior to receiving these delayed observations. In essence, if the agent receives an observation immediately after executing an action, D-POMDPs behave exactly as POMDPs. In a case where an observation does not reach the agent immediately, D-POMDPs behave differently from POMDPs. Rather than having to wait for an observation to arrive, a D-POMDP agent can resume the execution of its policy prior to receiving the observation. A D-POMDP agent can balance the trade off of acting prematurely (without the information provided by the observations that have not yet arrived) versus executing stop-gap (waiting) actions.
Quality bounded and efficient solutions for D-POMDPs are described herein. According to an exemplary embodiment of the present invention, a D-POMDP can be solved by converting the D-POMDP to an approximately equivalent POMDP and employing a POMDP solver to solve the obtained POMDP. A parameterized approach can be used for making the conversion from D-POMDP to its approximately equivalent POMDP. The level of approximation is controlled by an input parameter, D, which represents the number of delay steps considered in a planning process. The extended POMDP obtained from the D-POMDP is defined as the tuple S, A, Ω, P, R, O where S is the set of extended states and Ω is a set of extended observations that the agent receives upon executing its actions in extended states. P, R, O are the extended transition, reward and observations functions, respectively. To define these elements of the extended POMDP tuple, the concepts of extended observations, delayed observations, and hypothesis about delayed observations are formalized.
According to an exemplary embodiment of the present invention, an extended observation a vector
For example, an agent in a “Tiger Domain” can receive an extended observation ω=(OTigerLeft, Ø, OTigerRight) wherein OTigerRight is a consequence of action aListen executed two decision epochs ago.
According to an exemplary embodiment of the present invention, a hypothesis about a delayed observation for an action executed d decision epochs ago is a pair h[d]ε{(ω−,X),(ω+,X),(Ø,Ø)|ωεΩ; Xεχ} h[d]ε{(ω−, X), (ω+, X), (,)|ωεΩ; XεX}. Hypothesis h[d]ε(ω−,X) states that a delayed observation for an action executed d decision epochs ago is ωεΩ and that ω is yet to arrive, with a total delay sampled from probability distribution Xεχ. Hypothesis h[d]=(ω+, X) states that a delayed observation for an action executed d decision epochs ago was ωεΩ, that ω has just arrived (in the current decision epoch), and that its delay was sampled from probability distribution Xεχ. Finally, hypothesis h[d]=(Ø,Ø) states that an observation for an action executed d decision epochs ago had arrived in the past (in previous decision epochs). In the following, h[d][1] and h[d][2] are used to denote the observation and random variable components of h[d], that is, h[d]≡(h[d][1],h[d][2]).
For example, an agent in a “Tiger Domain” maintains a hypothesis h[2]=(oTigerRight−,χ) whenever it believes that action aListen executed two decision epochs ago resulted in observation OTigerRight that is yet to arrive, with a delay sampled from a distribution χ.
According to an exemplary embodiment of the present invention, an extended hypothesis about the delayed observations for actions executed 1, 2, . . . , D decision epochs ago is a vector h=(h[1], h[2], . . . , h[D]) where h[d] is a hypothesis about a delayed observation for an action executed d decision epochs ago. The set of all possible extended hypotheses is denoted by H.
In each decision epoch, the converted POMDP occupies an extended state
For example, the converted POMDP for a “Tiger Domain” can occupy an extended state
To construct the functions P, R and O that describe the behavior of a converted POMDP: Let
case 1: Is used when observation ω for action a executed d decision epochs ago has not yet arrived, i.e., if h[d−1][1]=h′[d−1][1]=ω− and obviously h[d−1][2]=h′[d][2].
case 2: Is used when observation ω for action a executed d decision epochs ago just arrived, i.e., if h[d−1][1]=ω−,h[d][1]=ω+ and obviously h[d−1][2]=h′[d][2].
case 3: Is used when observation ω for action a executed d decision epochs ago arrived in the previous decision epoch, i.e., if h[d−1][1]=ω+ and h[d]=(Ø,Ø).
case 4: Is used when an observation for action a executed d decision epochs ago had either arrived before the previous decision epochs or has not arrived and will not arrive.
In addition, for the special case of d=0, we define:
The probabilities Pb({h′[d][2]>d} {h′[d][2]≧d}) and Pb({h′[d][2]>d}|{h′[d][2]≧d}) are:
When the converted POMDP transitions to =(s′,h′)=(s′,(h′[1], h′[2], . . . , h′[D])) as a result of the execution of a, the agent receives an extended observation. The probability that this extended observation is
case 1: Is used when the agent had been waiting for a delayed observation ω for an action that it had executed d decision epochs ago but this delayed observation did not arrive in the extended observation
case 2: Is used when the agent had been waiting for a delayed observation ω for an action that it had executed d decision epochs ago and this delayed observation did arrive in the extended observation
case 3: Is used when the agent had not been waiting for a delayed observation for an action that it had executed d decision epochs ago and this delayed observation did not arrive in the extended observation
The extended POMDP thus obtained can be solved using any existing POMDP solvers.
According to an exemplary embodiment of the present invention, an online policy modification is exemplified by
Once the estimation of the current extended belief state is refined by these delayed observations from more than D decision epochs ago, the action corresponding to the new belief state is determined from the value vectors. The original set of value vectors (policy) would be still applicable, because belief state is a sufficient statistic and the policy is defined over the entire belief space.
Referring to
According to an exemplary embodiment of the present invention, a method 100 according to
To solve a decision problem involving delayed observations exactly, one must use an optimal POMDP solver and conversion from D-POMDP to POMDP must be done with D≧sup{d|Pb[X=d]>0, Xεχ} to prevent the delayed observations from ever being discarded. However, to trade-off optimality for speed, one can use a smaller D, resulting in a possible degradation in solution quality. The error in the expected value of the POMDP (obtained from D-POMDP) policy when such D is chosen, that is, when D is less than a maximum delay Δ of the delayed observations.
A D-POMDP constructed for a given D. For any s, s′δS, aεA and hεH it then holds that:
This proposition (i.e., Eq. (1)) bounds the error in P by estimating the true transition probability in the underlying Markov process. This is then used to determine the error bound on value as follows:
Using Eq. 1, the error in expected value of the POMDP (obtained from D-POMDP) policy for a given D is then bounded by:
According to an embodiment of the present invention, improvement in solution quality is achieved through online policy modification. One objective of online policy modification is to keep the belief distribution up to date based on the observations, irrespective of when they are received. In certain specific situations it is possible to guarantee a definite improvement in value.
Improvement in solution quality can be demonstrated in cases where: (a) a belief state corresponding to a delayed observation has more entropy than a belief state corresponding to any normal observation; and (b) for some characteristics of the value function, value decreases when the entropy of the belief state increases. Consider the following:
Corresponding to a belief state b and action a, denote by bω the belief state on executing action a and observing ω and by bφ the belief state on executing action a and an observation getting delayed (represented as observation φ). In this context, if
O({tilde over (s)},a,φ)=Oφ,∀{tilde over (s)}εS
For some constant Oφ then
For any two belief points b1 and b2 in the belief space, if
then the online policy modification improves on the value provided by the offline policy.
To graphically illustrate the improvement demonstrated in connection with Eq. (3),
Referring to the complexity of online policy modification, for a given D, the number of extended observations is |Ō|=|Ω∪{Ø}|D and the number of extended states is |
Given a D-POMDP, wherein a maximum delay for any observation is Δ:=sup{d|Pb[X=d]>0, Xεχ} and the time horizon is T, a maximum number of belief updates, Nb is given by:
It should be understood that the use of term maximum herein denotes a value, and that the value can vary depending on a method used for determining the same. As such, the term maximum may not refer to an absolute maximum and can instead refer to a value determined using a described method.
As can be seen in lines 9 and 10 of
Updates at time step 1: In an extreme case, the observation to be received at time step 1 is received at time step Δ. The said observation thus introduces just one extra belief update at time step 1.
Updates at time step 2: There are at most two extra belief updates introduced at time step 2: one from an observation generated at time step 1 but received at time step Δ and another from an observation generated at time step 2 but received at time step Δ+1.
Updates at time step t≦Δ: There are at most t extra belief updates introduced at time step t: one from each observation generated at time step t′ but received at time step Δ+t′, for 1≦t′≦t.
Updates at time step Δ≦t′≦T: There are at most Δ extra belief updates introduced at time step t: one from each observation generated at time step t′ but received at time step min{Δ+t′,T}, for t−Δ<t′≦t.
Adding a maximum numbers of extra belief updates introduced at time steps 1 through T, a maximum total number of belief updates is obtained as:
It should be understood that the methodologies of embodiments of the invention may be particularly well-suited for planning in uncertain conditions.
By way of recapitulation, according to an exemplary embodiment of the present invention, a decision engine (e.g., embodied as a computer system) performs a method (400) for adjusting a policy corresponding to delayed observations at runtime is shown in
The process of solving the policy (403) further includes receiving delayed observations (404), updating agent beliefs using the delayed observations, historical agent actions and historical observations (405), extracting an action using the updated agent beliefs (406) and executing the extracted action (407). At block 407, the agent can be instructed to execute the extracted action.
The methodologies of embodiments of the disclosure may be particularly well-suited for use in an electronic device or alternative system. Accordingly, embodiments of the present invention may take the form of an entirely hardware embodiment or an embodiment combining software and hardware aspects that may all generally be referred to herein as a “processor,” “circuit,” “module” or “system.”
Furthermore, it should be noted that any of the methods described herein can include an additional step of providing a system for adjusting a policy corresponding to delayed observations. According to an embodiment of the present invention, the system is computer executing a policy and monitoring agent actions. Further, a computer program product can include a tangible computer-readable recordable storage medium with code adapted to be executed to carry out one or more method steps described herein, including the provision of the system with the distinct software modules.
Referring to
In different applications, some of the components shown in
The processor 501 may be configured to perform one or more methodologies described in the present disclosure, illustrative embodiments of which are shown in the above figures and described herein. Embodiments of the present invention can be implemented as a routine that is stored in memory 502 and executed by the processor 501 to process the signal from the media 507. As such, the computer system is a general-purpose computer system that becomes a specific purpose computer system when executing routines of the present disclosure.
Although the computer system described in
The present invention may be a system, a method, and/or a computer program product. The computer program product may include a computer readable storage medium (or media) having computer readable program instructions thereon for causing a processor to carry out aspects of the present invention.
The computer readable storage medium can be a tangible device that can retain and store instructions for use by an instruction execution device. The computer readable storage medium may be, for example, but is not limited to, an electronic storage device, a magnetic storage device, an optical storage device, an electromagnetic storage device, a semiconductor storage device, or any suitable combination of the foregoing. A non-exhaustive list of more specific examples of the computer readable storage medium includes the following: a portable computer diskette, a hard disk, a random access memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or Flash memory), a static random access memory (SRAM), a portable compact disc read-only memory (CD-ROM), a digital versatile disk (DVD), a memory stick, a floppy disk, a mechanically encoded device such as punch-cards or raised structures in a groove having instructions recorded thereon, and any suitable combination of the foregoing. A computer readable storage medium, as used herein, is not to be construed as being transitory signals per se, such as radio waves or other freely propagating electromagnetic waves, electromagnetic waves propagating through a waveguide or other transmission media (e.g., light pulses passing through a fiber-optic cable), or electrical signals transmitted through a wire.
Computer readable program instructions described herein can be downloaded to respective computing/processing devices from a computer readable storage medium or to an external computer or external storage device via a network, for example, the Internet, a local area network, a wide area network and/or a wireless network. The network may comprise copper transmission cables, optical transmission fibers, wireless transmission, routers, firewalls, switches, gateway computers and/or edge servers. A network adapter card or network interface in each computing/processing device receives computer readable program instructions from the network and forwards the computer readable program instructions for storage in a computer readable storage medium within the respective computing/processing device.
Computer readable program instructions for carrying out operations of the present invention may be assembler instructions, instruction-set-architecture (ISA) instructions, machine instructions, machine dependent instructions, microcode, firmware instructions, state-setting data, or either source code or object code written in any combination of one or more programming languages, including an object oriented programming language such as Smalltalk, C++ or the like, and conventional procedural programming languages, such as the “C” programming language or similar programming languages. The computer readable program instructions may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the latter scenario, the remote computer may be connected to the user's computer through any type of network, including a local area network (LAN) or a wide area network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet Service Provider). In some embodiments, electronic circuitry including, for example, programmable logic circuitry, field-programmable gate arrays (FPGA), or programmable logic arrays (PLA) may execute the computer readable program instructions by utilizing state information of the computer readable program instructions to personalize the electronic circuitry, in order to perform aspects of the present invention.
Aspects of the present invention are described herein with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the invention. It will be understood that each block of the flowchart illustrations and/or block diagrams, and combinations of blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer readable program instructions.
These computer readable program instructions may be provided to a processor of a general purpose computer, special purpose computer, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks. These computer readable program instructions may also be stored in a computer readable storage medium that can direct a computer, a programmable data processing apparatus, and/or other devices to function in a particular manner, such that the computer readable storage medium having instructions stored therein comprises an article of manufacture including instructions which implement aspects of the function/act specified in the flowchart and/or block diagram block or blocks.
The computer readable program instructions may also be loaded onto a computer, other programmable data processing apparatus, or other device to cause a series of operational steps to be performed on the computer, other programmable apparatus or other device to produce a computer implemented process, such that the instructions which execute on the computer, other programmable apparatus, or other device implement the functions/acts specified in the flowchart and/or block diagram block or blocks.
The flowchart and block diagrams in the Figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods, and computer program products according to various embodiments of the present invention. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of instructions, which comprises one or more executable instructions for implementing the specified logical function(s). In some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems that perform the specified functions or acts or carry out combinations of special purpose hardware and computer instructions.
Although illustrative embodiments of the present invention have been described herein with reference to the accompanying drawings, it is to be understood that the invention is not limited to those precise embodiments, and that various other changes and modifications may be made therein by one skilled in the art without departing from the scope of the appended claims.
This application claims the benefit of U.S. Provisional Patent Application No. 62/036,417 filed on Aug. 12, 2014, the complete disclosure of which is expressly incorporated herein by reference in its entirety for all purposes.
This invention was made with Government support under Contract No.: W911NF-06-3-0001 awarded Army Research Office (ARO). The Government has certain rights in this invention.
Number | Date | Country | |
---|---|---|---|
62036417 | Aug 2014 | US |