The present invention relates generally to financial planning and investing, and particularly, to a system and method for devising investment strategies and determining an optimal investment strategy in accordance with an expected risk sensitivity at a particular point in time.
Recent years have seen an unprecedented rise of interest in decision support systems that help investors to choose an investment strategy to maximize their returns. In particular, Partially Observable Markov Decision Processes (POMDPs) (see, e.g., E. J. Sondik, entitled The Optimal Control of Partially Observable Markov Processes, Ph.D Thesis, Stanford University, 1971) have received a lot of attention due to their ability to provide multistage strategies that address the uncertainty of the investment outcomes and the uncertainty of market conditions head-on.
Yet, POMDP solvers (see, e.g., M. Hauskrecht, entitled Value-function approximations for POMDPs, JAIR, 13:33-94, 2000; Z. Feng and S. Zilberstein entitled Region-based incremental pruning for POMDPs in UAI, pages 146-15, 200; and, J. Pineau, G. Gordon, and S. Thrun entitled PBVI: An anytime algorithm for POMDPs, IJCAI, pages 335-344, 2003) typically maximize the expected utility of the investments. In contrast, in high-stake domains such as financial planning, it is often imperative to find an optimal investment strategy that maximizes the expected “utility” of the investments, for non-linear utility functions that characterize the investor attitude towards risk. While there has been demonstrated how to solve multistage stochastic optimization problems where risk-sensitivity is expressed via utility functions, this was only for problems characterized by fully observable market conditions.
It would be highly desirable to provide a system and method that enables the generation of a theoretic model for risk-sensitive financial planning under partially observable market conditions and the solution of such model that accounts for risk sensitivity.
Currently, there are no algorithms known in the art that can provide an optimal POMDP solution that accounts for risk sensitivity.
It would be highly desirable to provide a system and method that enables the generation of a theoretic model for risk-sensitive financial planning under partially observable market conditions and the solution of such model that accounts for risk sensitivity.
The present invention addresses the above-mentioned shortcomings of the prior art approaches by first defining Risk-Sensitive POMDPs, and generating a novel decision theoretic model for risk-sensitive financial planning under partially observable market conditions.
In one aspect, by considering piecewise linear approximations of utility functions, the method implements a functional value iteration method using a “solver” to solve Risk-Sensitive POMDPs optimally by computing the underlying value functions exactly, through the exploitation of their piecewise bilinear properties. In one aspect, the value functions are derived analytically using a Functional Value Iteration algorithm.
Further to this aspect, to speed up the implemented Risk-Sensitive POMDPs solver, the system and method performs finding and pruning the dominated investment strategies using efficient linear programming approximations to the underlying non-convex bilinear programs. That is, by deriving the fundamental properties of the underlying value functions, the method provides a functional value iteration technique to compute them exactly, and further, provides an efficient procedure to determine the dominated value functions, to speed up the algorithm.
In one aspect, there is provided a system, method and computer program product for determining an investment strategy for a risk-sensitive user. The method comprises: modeling an user's attitude towards risk as one or more utility functions, the utility functions, the utility function transforming a wealth of the user into a utility value; generating a risk-sensitive Partially Observable-Markov Decision Process (PO-MDP) based on the one or more utility functions; and, implementing Functional Value Iteration for solving the risk sensitive PO-MDP, the solution determining an action or policy calculated to maximize an expected total utility of an agent's actions at a particular point in time acting in a partially observable environment.
Further to this aspect, the generating of the risk-sensitive PO-MDP comprises: generating an expected utility function VUn(b,w) for 0≦n≦N,b∈B,w∈Wn where Wn denotes the set of all possible user wealth levels in decision epoch n; and, maximizing the expected utility function VUn(b,w) for a user when commencing action a∈A, where A is a set of Actions, in decision period n in a belief state b with a wealth level w.
In a further aspect, there is provided a system for determining an investment strategy for a risk-sensitive user comprising: a memory; a processor in communications with the memory, wherein the system performs a method comprising: modeling an user's attitude towards risk as one or more utility functions, the utility functions the utility function transforming a wealth of the user into a utility value; generating a risk-sensitive Partially Observable-Markov Decision Process (PO-MDP) based on the one or more utility functions; and, implementing Functional Value Iteration for solving the risk sensitive PO-MDP, the solution determining an action or policy calculated to maximize an expected total utility of an agent's actions at a particular point in time acting in a partially observable environment.
A computer program product is provided for performing operations. The computer program product includes a storage medium readable by a processing circuit and storing instructions run by the processing circuit for running a method. The method is the same as listed above.
The objects, features and advantages of the present invention will become apparent to one skilled in the art, in view of the following detailed description taken in combination with the attached drawings, in which:
In one aspect, there is provided a system, method and computer program product that provides and solves for a Risk-sensitive investor, an optimal investment strategy. In one embodiment, the system and method allows for Multistage investment strategies. The system and method operates to estimate market state from noisy observations, and, handles partially observable market states. Thus, in one aspect, to estimate the market state from noisy observations, the method of the invention employs modeling the data as a Partially Observable Markov Decision Process (PO-MDP).
In one embodiment, there are two ways to make decisions in such settings: (a) Expected value maximization 20 which is the risk neutral way to make decisions, i.e., by not considering that people have various attitudes towards risk. Thus, there is always the same decision made by this method. As shown in
As utility theory defines utility functions as transforming the current wealth of an agent (its initial wealth plus the sum of the immediate rewards it received so far) into a utility value, the shape of the utility function can be used to define the agent attitude towards risk. To compute optimal policies for such risk-sensitive agents, acting in partially observable environments, the finite horizon POMDPs may be solved that maximize the expected total utility of agent actions. On account of being sensitive to risk attitudes, these planning problems are referred to as Risk-Sensitive POMDPs characterized as comprising the following: S is a finite set of discrete states of the process; A is a finite set of agent actions. The process starts in some State s0∈S and runs for N consecutive decision epochs. In particular, if the process is in state s∈S in decision epoch 0≦n≦N, the agent controlling it chooses an action a∈A to be executed next. The agent then receives the immediate reward R(s,a) while the process transitions with probability P(s′|s,a) to state s′∈S at decision epoch n+1. Otherwise, in decision epoch n=N, the process terminates.
The utility of the actions that the agent has executed is then a scalar
U(w0+Σn=0N−1rn)
where w0 is the initial wealth of the agent, U is the agent utility function and rn is the immediate reward that the agent received in decision epoch n. The goal of the agent is to devise a policy π that maximizes its total expected utility:
E[U(w0+Σn=0N−1rn)|π].
What further complicates the agent's search for policy “π” is that the process is only partially observable to the agent. That is, the agent receives noisy information about the current state s∈S of the process and can therefore only maintain the current probability distribution b(s) over states s∈S (referred to as the agent belief state). When the agent executes some action a∈A and the process transitions to state s′, the agent receives with probability O(z|a, s′) an observation z from a finite set of observations Z. The agent then uses z to update its current belief state b, as will be described in greater detail herein below. In the following, B denotes an infinite set of all possible agent belief states and b0∈B is the agents' starting belief state (e.g., unknown at the planning phase).
Additionally, W:=∪0≦n≦NWn is the set of all possible agent wealth levels where Wn denotes the set of all possible agent wealth levels in decision epoch n. For the initial range of agent wealth levels W0:=[w0,
In the method 100 for providing or devising an optimal single or multi-stage investment strategy, at 102, an entity, a user, a business organization, a business target, an agent, to construct one or more utility functions. These utility functions are of a shape that can represent the user, e.g., agent's, attitude towards risk and the PO-MDPs solver framework is used to maximize the expected total utility (as opposed to expected total reward) of agent actions. For purposes of illustration,
In the set-up of the PO_MDP, the elicited utility function(s) U(w) that express the investor's attitude towards risk by mapping all attainable wealth levels w to their utility, as perceived by a user, e.g., an investor, an agent, for example, are input to a computer or like processing device such as described with respect to
As shown in
Then, this Risk-Sensitive POMDP is solved. That is, there is determined what action (policy) a∈A should the investor execute in decision epoch n∈[0, 1, . . . , N], with wealth level w∈[wmin, wmax], if the investor believes that the probability that the market is in state s is b(s), for all s∈S. As shown in
The processing at 110,
where P(z|b,a)=Σs′∈SO(z|a, s′)Σs∈SP(s′|s,a)b(s) is the probability of observing z after executing action a from belief state b, R(b,a):=Σs∈Sb(s)R(s,a) is the expected immediate reward that the agent will receive for executing action a in belief state b and T(b,a,z) is the new belief state of the agent after executing action a from belief state b and observed z. Formally, for each s′∈S it holds that:
T(b,a,z)(s′)=[O(z|a,s′)/P(z|b,a)]Σs∈SP(s′s,a)b(s).
Hence, to find the optimal policy, π*, value iteration is employed to calculate values VUn(b,w) for all 0≦n≦N, b∈B,w∈Wn. Value iteration calculates these values for n=N,N−1, . . . ,0. Specifically, as follows from step 150,
V
U
N(b,w)=U(w) 2)
for all w∈Wn, b∈B. Otherwise, for all 0≦n≦N,
for all b∈B and w∈Wn. In the following, values of VUn(b,w) are grouped over all (b,w)∈B×W into value functions VUn:B×W→, for each 0≦n≦N. Note, that computing value functions VUn from value functions VUn+1 exactly is difficult because B and W are infinite. In addition, POMDP solution techniques that already handle an infinite B—are not applicable for solving Risk-Sensitive POMDPs as they do not handle an infinite W.
The functional value iteration technique for solving Risk-Sensitive POMDPs exactly is now described according to one embodiment. This technique backs up utility functions (unlike just reward values in value iteration) defined on the wealth over the entire time horizon. The method iteratively constructs the finite partitioning of the B×W search space into regions where the value functions can be represented with point based policies, a point based policy being a mapping from the observations received so far to an action that should be executed next. For example, as shown in
In one embodiment, if there is only two states, then a belief state b belongs to a set [0,1] =B; a wealth interval on the other hand is [Wmin,W,max]=W. Thus, a “whole” region is B×W can be partitioned in multiple ways, e.g., into four sub-regions:
[0,0.5]×[Wmin, (Wmin+Wmax)/2]
[0,0.5]×[Wmin+Wmax)/2, Wmax]
[0.5,1]×[Wmin, (Wmin+Wmax)/2]
[0.5,1]×[Wmin+Wmax )/2, Wmax ]
To this end, Zn is denoted as a set of agent observation histories of length less than “n”. Also, for each decision epoch 0≦n≦N, there is defined a point based policy {dot over (π)}n as a function
{dot over (π)}n:ZN−n→A 4)
and the expected utility to go of {dot over (π)}n at some belief state and wealth level pair (b,w)∈B×Wn as a value (i.e., a function over B×Wn) set forth according to equation 5 ) as follows:
Letting {{dot over (π)}in}i∈I(n) be a collection of point-based policies such defined, for a decision epoch n, then any policy π can be represented as some (possibly infinite) collection of point-based policies. For example, to represent n in decision epoch n, a different point-based policy {dot over (π)}in may be maintained for each (b,w)∈B×Wn. In particular, to represent π* in decision epoch n, there may be maintained a different point-based policy argmax{dot over (π)}
In one aspect of the invention, finite collections {{dot over (π)}in}i∈I(n) for 0≦n≦N that represent π* are computed. The technique of the invention approximates that the utility function U(w) is piecewise linear over w∈WN (or, that it has already been approximated with a piecewise linear function with a desired accuracy). Specifically, given that there exists wealth levels wN=w1<. . . <wK=
According to the invention, for such U, as is proven by induction analysis, the following holds for all 0≦n≦N:
1. The value function VUn is represented by a finite set of functions {υ{dot over (π)}in}i∈I(n). That is, there exists a partitioning {Yin}i∈I(n) of B×Wn and a set of point-based policies {{dot over (π)}in}i∈I(n) such that for all (b,w)∈B×Wn there exists i∈I(n) such that (b,w)∈Yin and VUn(b,w)=υ{dot over (π)}in(b,w)=maxi′∈I(n)υ{dot over (π)}in(b,w).
2. For all i∈I(n), υ{dot over (π)}in is piecewise bilinear. That is, there exists a finite partitioning {B×Wi,kn}k∈I(n,i) of B×Wn such that Wi,kn is a convex set and for all (b,w)∈B×Wi,kn, υ{dot over (π)}in(b,w)=Σs∈Sb(s)(ci,k,snw+di,k,sn), for all k∈I(n,i);
3. For all i∈I(n), υ{dot over (π)}in can be derived from the set of functions {υ{dot over (π)}i′n+1}i′∈I(n+1).
As part of reduction analysis, induction holds for n+1 and it also holds for n. To this end, from Equation (3), as VUn(b,w) is calculated by:
which calculation is broken into five stages:
First, as shown in the Appendix, there is calculated, in a first stage,
Thus, as shown in
Referring to
The operation to construct the set of bilinear functions γn is performed by a Linear/Integer program “solver”, such as ILOG CPLEX™ available from International Business Machines, Inc.) embodied by a programmed computing system (e.g., a computing system 400 as shown in
N=The number of decision epochs;
U=The agent utility function(s) that maps the agent wealth w to its utility; U(w) is a piecewise linear approximation of an arbitrary utility function elicited from a user, e.g., an investor and is specified by constants Ck, and Dk, k=1, . . . , K, as explained in greater detail herein below.
As shown in
An example data structure to represent these solver inputs is therefore a tuple (N,U,S,A,P,Z,O,R) where N is an integer, U is a piecewise linear function on domain (min_wealth, max_wealth), S,A,O are binary vectors to give unique identifiers to states, actions and observations respectively. P:S×A×S →[0,1] is a state to state transition function, O:S×A×Z→[0,1] is an observation function and R:S×A→[reward_min, reward_max] is a reward function.
The equations for processing these inputs by the solver are programmed into the solver and are computed according to the proof by induction provided in the Appendix. Additionally, the solver proceeds by computing the value functions Vn(b,w) starting from n=N, then n=N−1, . . . , and finally n=0. As soon as V0(b,w) is found, the agent knows what action to execute in the starting decision epoch.
In solving the equations below, the following are defined:
n is the current epoch;
w is the wealth level;
s denotes some state;
b is a probability distribution over states, i.e., the agent current belief state;
b(s) is a an agent belief that the system is in state s with a certain probability, for all states from the set of states S. As an example, two states, sb and sg are considered such that sb=market is bad, and sg=market is good. Then b=(0.2, 0.8) means that the agent beliefs that the current system state is sg with probability b(sg)=0.2, and that the current system state is sb with probability b(sb)=0.8;
b is a belief variable;
w is a wealth variable;
(b,w) is a feasible solution to the
(b′ , x) is a feasible solution corresponding to (b,w) where (b′:=b,x:=bw);
x=[x(s)],s∈S is a vector.
Program (17) relaxes Program (16b) because for any there exists a corresponding feasible solution (b′:=b,x:=bw)
c and d (or the variations thereof, with various indices) are constants.
V(b,w) is the value function returned by the solver hat is represented using sets of bilinear functions.
The method includes implementing calculations performed by solver. When the algorithm starts, the known constants are the constants Ck and Dk k=1,2, . . . , K that specify the piecewise linear utility function U (defined in each of the K wealth intervals as a linear function Ck w+Dk). In the description of the method, auxiliary constants c and d are introduced (as set forth in the staged operations 1,2,3,4,5 in the Appendix).
The method includes:
for constants ca,z,in,k,sΣs′∈SP(s′|s,a)O(z|a,s′)ci,k,s′n+1 and, da,z,in,k,sΣs′∈SP(s′|s,a)O(z|a,s′)d i,k,s′n+1 where these constants ca,z,in,k,s and da,z,in,k,s are obtained by computer system from utility functions, observed data and belief states. For example, constants cn+1i,k,s and dn+1i,k,s are obtained by the computer system from utility functions (when n=N) or, from the previous algorithm iteration (when n<N). This calculation exhibits that function υa,z,in(b,w) from a stage 1 calculation is piecewise bilinear over (b,w)∈B×Wn+1.
for all (b,w)∈B×Win+1, k∈I(n+1,i) where
for all (b,w)∈B×Wn+1, where constants ca,in,k,s:=Σz∈Z
4. Calculating, by the solver after a stage 3 calculation, the following equation (24 ) from Lemma 3, Appendix:
for all (b,w)∈B×Wn, where
5. Then, calculating, by the solver, the following equation (25 ) from Lemma 3, Appendix:
6. Finally, there is calculated by the solver, a calculation of the following equation (15) from Stage 5, Appendix:
Therefore, VUN(b,w) is represented by a finite set of piecewise bilinear functions Vn={υ{dot over (π)}(a,i)n}(a,i)∈I(n)={
Thus, in the method implemented by the solver, the output produced at each of the equations below is a new (temporary) set of bilinear functions, represented using the corresponding new (temporary) constants c and d (with different indices). At the last step, the solver returns the value function V(b,w) at an epoch n that is represented using sets of bilinear functions Vn={υ{dot over (π)}(a,i)n}(a,i)∈I(n)={
Thus, when the algorithm terminates, each bilinear function “fi” from set Vn={υ{dot over (π)}(a,i)n}(a,i)∈I(n)={
f
i=sum {
s}(b(s)*(cni,k,s*w+dni,k,s))
is bilinear.
That is, in view of
In a further embodiment, in order to speed up the implemented Risk-Sensitive POMDP solver, the system and method includes finding and pruning the dominated investment strategies using efficient linear programming approximations to underlying non-convex bilinear programs. Thus, referring to
In one exemplary embodiment, as mentioned in the stages 1,3,5 of the induction proof incorporated herein such as described in Appendix, the solver implements functionality for speeding up the algorithm by pruning, from a set of piecewise bilinear functions, these functions that are jointly dominated by other functions. The solver implemented quickly and accurately identifies if a function is dominated or not. Formally, for a set of piecewise bilinear functions V={υi:B×W→R}i∈I there is determined if some υj∈V is dominated, i.e., if for all (b,w)∈B×W there exists υi∈V,i≠j such that υi(b,w)>υj(b,w).
Letting υi∈V be piecewise bilinear over B×W , i.e., there is a partitioning {B×Wi,k}1≦k≦K(i) of B×W such that set Wi,k is convex and υi(b,w)=Σs∈Sci,ksw+di,ks for all (b,w)∈B×Wi,k, 1≦k≦K(i). Thus, there exists wealth levels w=wi,0<. . . <wi,k<. . . <wi,K(i)=
υj∈V is then not dominated if there exists 1≦k≦K and (b,w)∈B×[wk−1,wk] such that for all υi∈V, i≠j it holds that υi,k(b,w)<υj,k(b,w). That is, if for some 1≦k≦K there exists a feasible solution (b,w) to Program
also written as
where the program “max O”[+terms]” represents the attempt to maximize the objective function “O”, i.e., an empty/blank objective function; variable b=[b(s)]s∈S is a vector; ci,j,ks:=
In one embodiment, due to presence of non-linear, non-convex constraints in solving Program (16b), i.e., because of term Σs∈Sb(s)ci,j,ksw+di,j,ks)>0, υi∈V, a solution is to relax the constraints.
However, by relaxing the constraints of Program (16b), the chance of finding a feasible solution (b,w) is increased, thus decreasing the chance of pruning υj from V. Therefore such a relaxation may result in keeping in V some of the dominated functions, which may slow down the algorithm.
As some of the constraints in these Programs (16,17,18) involve a multiplication of variables b and w there is a quadratic term which must be linearized before being input to CPLEX solver. By replacing variables (b,w) with (b′,x), any quadratic terms can be eliminated, and therefore the program can be fed to a linear program solver CPLEX.
By approximating Equation 16 generation with a linear program, this can be fed to a CPLEX solver to indicate whether the corresponding linear program has a feasible solution. Thus, one relaxation approximates Program (16b) with a linear program
where b′=[b′(s)]s∈S and x=[x(s)]s∈S are vectors. Program (17) relaxes Program (16b) because for any feasible solution (b,w) there exists a corresponding feasible solution (b′:=b,x:=bw). If Σs∈Sb(s)(ci,j,ksw+di,j,ks)>0 in Program (16b), then Σs∈Sb(s)wci,j,ks+b(s)di,j,ks>0 and thus, Σs∈Sx(s)ci,j,ksb′(s)di,j,ks>0 in Program (17), for all υi∈V. Next, if wk−1≦w≦wk in Program (16b) then for all s∈S, b(s)wk−1≦b(s)w≦b(s)wk and thus b′(s)wk−1≦x(s)≦b′(s)wk in Program (17). Finally, if Σs∈Sb(s)=1 then Σs∈Sb′(s)=1. Conversely, a feasible solution (b′,x) may not imply a corresponding feasible solution (b,w). That is, while Σs∈Sx(s)ci,j,ks+b′(s)di,j,ks>0 in Program (17) implies that Σs∈Sb′(s)([x(s)/b′(s)]ci,j,ks+di,j,ksk )>0, all the ratios [x(s)/b′(s)],s∈S would need to be equal to some unique wk−1≦w≦wk for Σs∈Sb′(s)(ci,j,ksw+di,j,ksk )>0 to hold.
Because Program (17) relaxes Program (16b), its decision to not prune υj from V—a result of finding a feasible solution (b′,x)—in one embodiment, may be too conservative. However, the smaller the wealth interval [wk−1,wk], the more accurate Program (17) becomes, that is, the greater the chance that a feasible solution (b′,x) implies a feasible solution (b,w). Thus, for a given feasible solution (b,x), let (b:=b′,w:=wk−1) be a candidate solution to Program (16b). Clearly Σs∈Sb(s)=1 and wk−1≦w≦wk. In addition, for all υi∈V it holds for Cimax:=maxs∈S|ci,j,ks that
and thus, limw
In one embodiment, to speed up the algorithm, the constraint Σs∈Sx(s)ci,j,ks+b′(s)di,j,ks>0 of Program (17) is tightened by some ε>0. Specifically, it is less likely to find a feasible solution to Program
than to Program (17) and thus, more likely to prune more functions from V, which speeds up the algorithm. However, Program (18) may classify some of the non-dominated functions as dominated ones and hence, the pruning procedure will no longer be error-free. The total error of the algorithm, however, is bounded. In one embodiment, it can be trivially bounded by ε·3·N, where a tunable parameter ε of Program (18) is the error of the pruning procedure, 3 is the number of stages (of the proof by induction) that call the pruning procedure and N is the planning horizon.
Thus, speeding up the algorithm described by equations 16), 17), 18) as solver finds the value functions Vn(b,w) (for the decision epochs n=0,1, . . . ,N) and each value function is represented by a number of bilinear functions. Some of these bilinear functions might be redundant, because they are completely dominated by other bilinear functions and hence, will never be used by the agent when deciding what action to execute. These completely dominated bilinear functions are pruned while the underlying value functions are still represented exactly, but with a reduced number of bilinear functions. This reduces computation time, because the number of bilinear functions needed (e.g., in a worst case) to represent the value function grows exponentially with n.
This methodology scales to larger extensions. For example, there is considered a bigger domain, including 100 different states of the market (e.g., markets of different countries), and considering 5 different actions to invest in markets of different countries. With respect to the algorithm, different values (0.5,1,1.5,2,2.5 ) of the approximation parameter ε (used in Program (18) were tested). Also, the planning horizon was fixed at N=10 and the algorithm is run for each utility function (A),(B),(C),(D),(E) as shown in the plot of utility functions 300 shown in
Thus, by employing Risk-Sensitive POMDPs, an extension of POMDPs, in risk domains such as financial planning, the agents are able to maximize the expected utility of their actions. The exact algorithm solves Risk-Sensitive POMDPs, for piecewise linear utility functions by representing the underlying value functions with sets of piecewise bilinear functions—computed exactly using functional value iteration—and pruning the dominated bilinear functions using efficient linear programming approximations of the underlying non-convex bilinear programs.
As will be appreciated by one skilled in the art, aspects of the present invention may be embodied as a system, method or computer program product. Accordingly, aspects of the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment (including firmware, resident software, micro-code, etc.) or an embodiment combining software and hardware aspects that may all generally be referred to herein as a “circuit,” “module” or “system.” Furthermore, aspects of the present invention may take the form of a computer program product embodied in one or more computer readable medium(s) having computer readable program code embodied thereon.
Any combination of one or more computer readable medium(s) may be utilized. The computer readable medium may be a computer readable signal medium or a computer readable storage medium. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples (a non-exhaustive list) of the computer readable storage medium would include the following: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a random access memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or Flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the context of this document, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with a system, apparatus, or device running an instruction.
A computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with a system, apparatus, or device running an instruction. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.
Computer program code for carrying out operations for aspects of the present invention may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, Smalltalk, C++ or the like and conventional procedural programming languages, such as the “C” programming language or similar programming languages. The program code may run entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the latter scenario, the remote computer may be connected to the user's computer through any type of network, including a local area network (LAN) or a wide area network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet Service Provider).
Aspects of the present invention are described below with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the invention. It will be understood that each block of the flowchart illustrations and/or block diagrams, and combinations of blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, or other programmable data processing apparatus to produce a machine, such that the instructions, which run via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks. These computer program instructions may also be stored in a computer readable medium that can direct a computer, other programmable data processing apparatus, or other devices to function in a particular manner, such that the instructions stored in the computer readable medium produce an article of manufacture including instructions which implement the function/act specified in the flowchart and/or block diagram block or blocks.
The computer program instructions may also be loaded onto a computer, other programmable data processing apparatus, or other devices to cause a series of operational steps to be performed on the computer, other programmable apparatus or other devices to produce a computer implemented process such that the instructions which run on the computer or other programmable apparatus provide processes for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks.
The flowchart and block diagrams in the Figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present invention. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more operable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be run substantially concurrently, or the blocks may sometimes be run in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems that perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
Assume n=N. Let Y0N :=B×WN, I(N):={0} and {dot over (π)}0N be an arbitrary policy. Because at decision epoch N the process terminates, it holds for all (b,w)∈Y0N that (from Equations (2) and (5)) VUN(b,w)=U(w)=E[U(w)]=E[U(w+Σn=NN−1rn)|{dot over (π)}0N,b0=b])=υ{dot over (π)}0N(b,w)=maxi∈I(N)υ{dot over (π)}iN(b,w), which proves claim 1. Furthermore, to prove that υ{dot over (π)}0Nis piecewise bilinear, let I(N,0):={1, . . . ,K} and W0,kN:=[wk, wk+1), k∈I(N,0). Clearly, {B×W0,kN}k∈I(N,0) is a finite partitioning of B×Wn and sets W0,kN:=k∈I(N,0) are convex. In addition, υ{dot over (π)}0N(b,w)=Σsb(s)(Ckw+Dk)=Ckw+Dk for all (b,w)∈B×W0,kN, k∈I(N,0) and hence, υ{dot over (π)}0N(b,w) is linear—thus also piecewise bilinear—over (b,w)∈B×W N, which proves claim 2. Finally, claim 3 holds because we constructed υ{dot over (π)}0Nwithout even considering the set of functions {υ{dot over (π)}i′N+1}i′∈I(N+1) and our choice of {dot over (π)}0N was arbitrary. The induction thus holds for n=N.
Assume now that the induction holds for n+1 . Our goal is to prove that it also holds for n. To this end, recall from Equation (3) that VUn(b,w) is calculated by
We break this calculation into five stages. First, we calculate VU,a,zn(b,w):=VUn+1(T(b,a,z),w) where VUn+1 is represented by {υ{dot over (π)}in+1}i∈I(n+1) from the induction assumption. Next, we derive
Calculate VU,a,zn(b,w):=VUn+1(T(b,a,z),w).
From the induction assumption, VUn+1 is represented by a finite set of functions {υ{dot over (π)}in+1}i∈I(n+1), corresponding to point-based policies {dot over (π)}i, i∈I(n+1), and each υ{dot over (π)}in+1 is piecewise bilinear. We now prove that VU,a,zn(b,w):=VUn+1(T(b,a,z),w) can be represented by a finite set of functions Va,zn(b,w):={υa,z,in}i∈I(n+1) derived from a collection of functions {υ{dot over (π)}in+1}i∈I(n+1) and that each function υa,z,i n is piecewise bilinear. To this end, define a finite partitioning {Ya,z,in}i∈I(n+1) of B×Wn+1 where
and a finite set of functions Va,zn={υa,z,in}i∈I(n+1) where
υa,z,in(b,w):=υ{dot over (π)}in+1(T(b,a,z),w) (7)
for all (b,w)∈B×Wn+1 . It is then true that for all (b,w)∈B×Wn+1 there exists i∈I(n+1) such that (b,w)∈Ya,z,in and υa,z,in(b,w):=υ{dot over (π)}in+1(T(b,a,z),w)=maxi′υ{dot over (π)}in+1(T(b,a,z),w)=VU,a,zn+1(T(b,a,z),w)=VU,a,zn(b,w). Thus, VU,a,zn(b,w) can be represented by a finite set of functions Va,zn={υa,z,in}i∈I(n+1) derived from {υ{dot over (π)}in+1}i∈I (n+1). In addition, each υa,z,in is piecewise bilinear as proven by Lemma 1 in the Appendix.
Finally, notice that if function υa,z,in∈Va,zn is dominated by other functions υa,z,i′n∈Va,zn, i.e., if for any (b,w)∈B×Wn+1 there exists i′∈I(n+1),i′≠i such that υa,z,in(b,w)<υa,z,i′n(b,w) then (from definition (6)) Ya,z,in=Ø. In such case (to speed up the algorithm) υa,z,in can be pruned from Va,zn and Ya,z,in be removed from {Ya,z,in}i∈I(n+1) as that will not affect the representation of VU,a,zn. (How to determine if a function υa,z,in is dominated is explained later.) The value functions VU,a,in(b,w) can thus be represented by a finite sets of piecewise bilinear functions Va,zn={υa,z,in}i∈I(n,a,z) where I(n,a,z)⊂I(n+1) .
Calculate
Consider the value functions VU,a,zn(b,w) represented after stage 1 by finite sets of piecewise bilinear functions Va,zn={υa,z,in}i∈I(n,a,z). We now demonstrate that the value function
a,z,i
n(b,w):=P(z|b,a)υa,z,in(b,w) (8)
for all (b,w)∈B×Wn+1. Indeed, since {Ya,z,in}i∈I(n,a,z) is a partitioning of B×Wn+1 (from definition (6)), it holds for all (b,w)∈B×Wn+1 that there exists i∈I(n,a,z) such that (b,w)∈Ya,z,in and
for all (b,w)∈B×,Win+1,k∈I(n+1,i) where
Calculate VU,an(b,w):=Σz∈Z
Consider the value functions
and a function
for all (b,w)∈B×Wn+1. To show that VU,an can be represented with a set of functions Van={υa,in}i∈I(n,a) we first prove that {Ya,in}i∈(n,a) is a finite partitioning of B×Wn+1. To this end, first observe that Ya,in∩Ya,i′n=Ø for all i,i′∈I(n,a),i≠i′. Indeed, if i≠i′ then i(z)≠i′(z) for some z∈Z. Thus, if (b,w)∈Ya,in∩Ya,i′n then in particular (b,w)∈Ya,z,i(z)n∩Ya,z,i′(z)n which is impossible because Ya,z,i(z)n∩Ya,z,i′(z) n≠Ø for i(z)≠i′(z) (from definition (6)). Also, if (b,w)∈B×Wn+1 then for all z∈Z there exists some i(z)∈I(n,a,z) such that (b,w)∈Ya,z,i(z)n (from definition (6)). Hence, for the vector i:=[i(z)]z∈Z∈I(n,a) it must hold that (b,w)∈∩z∈ZYa,z,i(z)n=Ya,in.
We then show that VU,an can be represented with a set of functions Van={υa,in}i∈I(n,a) as follows: Since {Ya,in}i∈I(n,a) is a partitioning of B×Wn+1, for each (b,w)∈B×Wn+1 there exists i=[i(z)]z∈Z∈I(n,a) such that (b,w)∈Ya,in and VU,an(b,w):=Σz∈Z
Finally, notice that if function υa,in∈Van is dominated by other functions υa,i′n∈Van then Ya,in=Ø. Precisely, for any (b,w)∈B×Wn+1, if there exists some other function υa,i′n∈Van such that υa,in(b,w)<υa,i′n(b,w) then (from definition 11)
Calculate
For notational convenience in this stage (but without the loss of precision), we denote vectors i,k defined in stage 3, as i,k. Recall that Wn is the set of all possible wealth levels at decision epoch n and that Wn−1=[wn−1,wn−1]⊂,[wn,
a,i
n:={(b,w)∈B×Wn
such that
(b,w+R(b,a))∈Ya,in} (12)
and a function
a,i
n(b,w):=υa,in(b,w+R(b,a)). (13)
To show that
We then show that
Calculate VU(b,w):=maxa∈A
Consider the value functions
and a point based policy {dot over (π)}(a,i)n according to which the agent first executes action a∈A and then, depending on the observation z∈Z received, follows the policy {dot over (π)}i(z)n+1 given by the induction assumption.
Clearly, {Y(a,i)n}(a,z)∈I(n) is a finite partitioning of B×Wn. Thus, for all (b,w)∈B×Wn there exists some (a,i)∈I(n) such that (b,w)∈Y(a,i)n and
(the last equality follows directly from definitions (13) (11) (8) (7)). Therefore, VUn can indeed be represented by a finite set of piecewise bilinear functions Vn={υ{dot over (π)}(a,i)n}(a,i)∈I(n)={
Finally, notice that if a function υ{dot over (π)}(a,i)n∈Vn is dominated by other functions υ{dot over (π)}(a′,i′)n∈Vn, i.e., if for all (b,w)∈B×Wn there exists some υ{dot over (υ)}(a′,i′)n∈Vn such that υ{dot over (π)}(a,i)n(b,w)<υ{dot over (π)}(a′,i′)n(b,w) then Y(a,i)n=Ø. In such case, (to speed up the algorithm) υ{dot over (υ)}(a,i)n can be pruned from Vn and Y(a,i)n be removed from {Y,(a,i)n}(a,i)∈I(n) as that will not affect the representation of VUn.
Function υa,z,in:=υ{dot over (π)}in+1(T(b,a,z),w) is piecewise bilinear over (b,w)∈B×Wn+1.
From induction assumption, υ{dot over (π)}in+1(b,w) is piecewise bilinear over (b,w)∈B×Wn+1, i.e., there exists a finite partitioning {B×Wi,kn+1}k∈I(n +1,i) of B×Wn+1 such that Wi,kn+1 is a convex set and υ{dot over (π)}in+1(b,w)=Σs∈Sb(s)(ci,k,sn+1w+di,k,sn+1) for all (b, w)∈B×Wi,kn+1,k∈I(n+1,i). We now prove that υa,z,in(b,w):=υ{dot over (π)}in+1(T(b,a,z),w) too is piecewise bilinear over (b,w)∈B×Wn+1 for the partitioning {B×Wi,kn+1}k∈I(n+1,i) of B×Wn+1. To this end, for each s∈S distinguish a belief state bs∈B such that bs(s)=1. It then holds for all (b,w)∈B×Win+1,k∈I(n+1,i) that
for constants ca,z,in,k,sΣs′∈SP(s′|s,a)O(z|a,s′)ci,k,s′n+1 and da,z,in,k,sΣs′∈SP(s′|s,a)O(z|a,s′)di,k,s′n+1. Consequently, function υa,z,in(b,w) is piecewise bilinear over (b,w)∈B×Wn+1 which proves the Lemma.
Function υa,in(b,w):=Σz∈Z
After stage 2 it holds for all z∈Z that
and a bilinear function
for all (b,w)∈B×Wn+1 and constants ca,in,k,s:=Σz∈Z
We can therefore prove that functions {υa,i,kn}k∈I(n,a,i) represent υa,in(b,w) over all (b,w)∈B×Wn+1 as follows: For each (b,w)∈B×Wn+1 there exists k∈I(n,a,i) such that (b,w)∈B×Wa,i,kn+1. Hence, (from definition (20)) (b,w)∈B×Wi(z),k(z)n+1 and thus, (from definition (9))
Function
After stage 3 it is true for all i∈I(n,a) that υa,in(b,w) is piecewise bilinear over (b,w)∈B×Wn+1, i.e., there exist a partitioning {B×Wa,i,kn+1}k∈I(n,a,i) of B×Wn+1 such that Wa,i,kn+1 is convex and υa,in(b,w)=υa,i,kn(b,w)=Σs∈Sb(s)(ca,in,k,sw+da,in,k,s) for all (b,w)∈B×Wa,i,kn+1, for all k∈I(n,a,i). To prove that
a,i,k
n,s
:={w∈W
n
|w+R(s,a)∈Wa,i,kn+1} (22)
Now, let k:=[k(s)]s∈S denote a vector where k(s)∈I(n,a,i). Ī(n,a,i) is a set of all such vectors k. For each vector k∈(n,a,i) then define a set
and a bilinear function
for all (b,w)∈B×Wn where
We then show that functions {
Finally, each set
The U.S. Government has a paid-up license in this invention and the right in limited circumstances to require the patent owner to license others on reasonable terms as provided for by the terms of Contract. No. W911NF-06-3-0001 awarded by the United States Army.