BELIEF SPACE PLANNING THROUGH INCREMENTAL EXPECTATION UPDATING

Information

  • Patent Application
  • 20200327358
  • Publication Number
    20200327358
  • Date Filed
    April 10, 2020
    4 years ago
  • Date Published
    October 15, 2020
    3 years ago
Abstract
A system and methods are provided for decision making under uncertainty, for selecting an optimal action from among multiple candidate actions in belief space planning (BSP), including, for a new propagated belief: selecting from stored propagated beliefs a closest propagated belief; selecting, from among the stored measurement samples associated with the closest propagated belief, re-use measurement samples that provide a representative set of a measurement likelihood distribution corresponding to the new propagated belief; determining an information gap between a stored posterior belief associated with a re-use measurement sample and a new posterior belief that would be inferred by applying the re-use measurement sample to the new propagated belief; updating the stored posterior belief to account for the information gap; and calculating objective values for the multiple candidate actions, the objective values being weighted summations including immediate scores of the updated posterior beliefs.
Description
FIELD OF THE INVENTION

The present invention relates to the field of artificial intelligence.


BACKGROUND OF THE INVENTION

Computationally efficient approaches for decision making under uncertainty are useful for a range of artificial intelligence applications, including guidance of autonomous systems. Decision making typically includes determination of optimal actions over time. A common paradigm for such decision making is Belief Space Planning (BSP), one form of which is known as the partially observable Markov decision process (POMDP) problem, in which different candidate actions are considered together with predicted future measurements to determine an optimal action. The (POMDP) problem is known to be computationally intractable for all but the smallest problems, i.e. no more than few dozen states.


The main cause for the BSP problem intractability lies in the calculation of expectation in the objective function:










J


(

)


=




z





[



i




c
i



(


b
i

,

u

i
-
1



)



]

.





(
1
)







The objective over a candidate action sequence custom-character, is obtained by calculating the expected value of all possible rewards (or costs) r received from following custom-character. Since the reward (or cost) function is a function of the belief b and the action led to it u, in practice the objective considers all future beliefs obtained from following custom-character, i.e. the expectation considers the joint measurement likelihood of all future measurements z. This general problem is referred to as the full solution of BSP, denoted by X-BSP, expectation-based BSP.


Performing inference over multiple future beliefs is one of the main reasons for the costly computation time of X-BSP. In a planning session with horizon of 3 steps ahead, 3 candidate actions per step and 3 samples per action, there are 819 beliefs to be solved. Cutting down on the computation time of each belief would benefit the overall computation time of the planning process.


As in any computational problem, one can either streamline the solution process or change the problem, i.e. take simplifying assumptions or approximations. Over the years, numerous approaches have been developed to trade-off sub-optimal performance with reduced computational complexity of POMDP. While the majority of these approaches, assume that some sources of absolute information (GPS, known landmarks) are available or considered the environment to be known, recent approaches have relaxed these assumptions, accounting for the uncertainties in the mapped environment thus far as part of the decision making process. An example of such an approach is described in V. Indelman, et al., “Planning in the continuous domain: a generalized belief space approach for autonomous navigation in unknown environments,” in Intl. J. of Robotics Research, 34 (7):849-882, 2015, incorporated herein by reference.


Other than assuming available sources of absolute information, some of these approaches use discretization in order to reduce computational complexity. Sampling based approaches, discretize the state space using randomized exploration strategies to locate the belief's optimal strategy. Other approaches, discretize the action space, thus trading optimality for reduced computational load. While many sampling based approaches—such as, probabilistic road maps (PRMs), rapidly exploring random trees, and rapidly exploring random graphs (RRGs)—assume perfect knowledge of the state, along with deterministic control and a known environment, efforts have been made to remove these simplifying assumptions. These efforts vary in the alleviated-assumptions, from the belief road map (BRM) and rapidly exploring random belief trees (RRBTs), to partially observable Monte-Carlo planning (POMCP), determinized sparse partially observable trees (DESPOTs) and full simultaneous localization and mapping (SLAM) in discrete and continuous domains, accounting for uncertainties in the environment mapped thus far as part of the decision making process.


In contrast to the large amount of work on approximating the X-BSP problem, only a few approaches include re-use of prior calculations. Although under simplifying assumptions, including ML, these approaches have described re-using computationally expensive calculations during planning. These include: Kopitkov and Indelman, “No belief propagation required: Belief space planning in high-dimensional state spaces via factor graphs, matrix determinant lemma and re-use of calculation,” Intl. J. of Robotics Research, 36 (10):1088-1130, August 2017; and Chaves and Eustice, “Efficient planning with the Bayes tree for active SLAM,” in Intelligent Robots and Systems (IROS), 2016 IEEE/RSJ International Conference, pp. 4664-4671. IEEE, 2016, both of which are incorporated herein by reference. Chaves and Eustice consider a Gaussian belief under ML-BSP in a Bayes tree representation. All candidate action sequences consider a shared location (entrance pose), thus enabling the re-use of calculations with state ordering constraints. Kopitkov and Indelman also consider a Gaussian belief under ML-BSP, utilizing a factor graph representation of the belief while considering an information theoretic cost. Like Chaves and Eustice, they consider calculation re-use within the same planning session.


A further example of re-use, based on the simplifying assumption that all samples taken from a single distribution, can be re-used is described by the inventors of the present invention in the following references, which are incorporated herein by reference: Farhi and Indelman, in “Towards effcient inference update through planning via jip—joint inference and belief space planning,” IEEE Intl. Conf. on Robotics and Automation (ICRA), 2017; and PCT application WO2019/171378, entitled, “Efficient Inference Update using Belief Space Planning.”


Re-use of prior belief calculations, without the simplifying assumptions described above, could have broader application to many systems employing artificial intelligence, such as autonomous systems.


SUMMARY OF THE INVENTION

Embodiments of the present invention provide methods for Incremental expectation BSP, or iX-BSP, which incrementally updates the expectation-related calculations in X-BSP, by re-using measurements sampled in a prior planning session, while selectively re-sampling measurements in order to assure an adequate representation of the measurement likelihood. Instead of re-calculating the planning session each time from scratch, the method described below includes incrementally updating a previous session with newly received information. In addition, the approximation ML-BSP to formulate incremental ML-BSP, referred to herein as iML-BSP, is expanded by enforcing the ML assumption over iX-BSP, as being done over X-BSP. Given access to calculations from pre-cursory planning, at each look ahead step i in the current planning session, iML-BSP considers the appropriate sample from the ith look ahead step in the precursory planning session for re-use. If the sample constitutes an adequate representation of the measurement likelihood that would have been considered at the ith look ahead step in current planning session, then iML-BSP utilizes the associated previously solved belief from the precursory planning session. If the mentioned sampled is considered as an inadequate representation of the measurement likelihood, iX-BSP may follow the process of ML-BSP, whereby the most likely measurement of the nominal measurement likelihood is considered instead.


The approach described herein includes: calculating incrementally the expectation over future observations by a set of samples comprised of newly sampled measurements and re-used sampled measurements that were generated at prior planning sessions; and selectively resampling based on a method of Multiple Importance sampling, while considering a balance heuristic.


There is therefore provided, by embodiments of the present invention, a system and methods for decision making under uncertainty, including selecting an optimal action from among multiple candidate actions, including the implementation of steps that include: inferring, from a current belief, multiple new propagated beliefs according to respective multiple candidate actions; accessing a stored set of propagated beliefs, each given stored propagated belief being associated with one or more stored measurement samples and with one or more respective stored posterior beliefs inferred from the given stored propagated belief, according to a respective measurement sample, wherein the stored propagated beliefs were propagated from prior beliefs during one or more precursory planning sessions at one or more respective previous times.


The steps further include, for each new propagated belief generated for a respective candidate action: selecting from the set of stored propagated beliefs a closest stored propagated belief; and selecting, from among the one or more stored measurement samples associated with the closest stored propagated belief, re-use measurement samples for a representative set of a measurement likelihood distribution corresponding to the new propagated belief.


The steps further include, for each re-use measurement sample: determining an information gap between the stored posterior belief associated with the re-use measurement sample and a new posterior belief that would be inferred by applying the re-use measurement sample to the new propagated belief; responsively updating the stored posterior belief to account for the information gap and associating the new propagated belief with the updated posterior belief; and calculating an immediate score for the updated posterior belief associated with the new propagated belief.


The steps further include, subsequently, calculating objective values for the multiple candidate actions, wherein the calculation of the objective values is a weighted summation including the immediate scores of the updated posterior beliefs; and subsequently, determining the optimal action from among the multiple candidate actions according to the candidate action with the optimal objective value.


In some embodiments, the steps may also include: newly sampling one or more additional measurement samples for one or more of the new propagated beliefs, inferring from the one or more additional measurement samples respective one or more additional posterior beliefs, and calculating additional immediate scores for the one or more additional posterior beliefs. The calculation of the objective values is typically a summation including the immediate scores of the updated posterior beliefs and of the additional posterior beliefs. Selecting re-use measurement samples for the representative set may include determining that the re-use measurement samples are an inadequate representative measurement set and that adding the additional measurement samples provides an adequate representative measurement set.


In further embodiments, selecting the re-use measurement samples for the representative set may include determining which measurement samples are within a pre-determined variance range of the measurement likelihood distribution.


In further embodiments, the steps may also include determining that the information gap is less than a wildfire threshold and updating the stored posterior belief to account for the information gap may include making no updating calculations of the stored posterior belief.


The candidate actions may be sequences of actions over a planning horizon, by which respective sequences of propagated beliefs branch out from the new propagated belief, and the stored propagated beliefs may include planning horizons including stored branches of stored propagated beliefs and associated measurements. The steps may further include, upon determining that the information gap is less than a wildfire threshold, associating a stored branch of the closed stored propagated belief with the new propagated belief, with no updating of stored posterior beliefs of the stored branch.


In further embodiments, the stored propagated beliefs include planning horizons including stored branches of stored propagated beliefs and associated measurements, wherein a stored branch of the closest propagated belief has a planning horizon of L1, wherein a time of the closest propagated belief is k, such that the last posterior beliefs of the stored branch are associated with a time k+L1, wherein the multiple candidate actions have a planning horizon of L2, wherein a time of the new propagated belief is k+l, and wherein the steps of the system further comprise: sampling new measurements between times k+L1 and k+L2+l; responsively inferring additional posterior beliefs; calculating for the additional posterior beliefs respective additional immediate scores, and calculating the objective values by a weighted summation incorporating the additional immediate scores.


In some embodiments, the immediate score may reflect a cost that is a function of the posterior belief and the candidate action, and wherein the optimal objective value is a minimum objective value of the candidate actions. Alternatively, the immediate score may reflect a reward that is a function of the posterior belief and the candidate action, and wherein the optimal objective value is a maximum objective value of the candidate actions.


In further embodiments, the one or more measurement samples associated with each propagated belief space may be one sampled measurement that is selected as the maximum likelihood measurement. Determining the information gap may include performing data association (DA) matching to determine measurements with DA to update, to add or to remove. Updating the stored posterior belief may include performing corresponding DA modifications to generate updated or added measurements and responsively to update one or more corresponding measurement values.


The invention accordingly comprises the several steps and the relation of one or more of such steps with respect to each of the others, and the apparatus embodying features of construction, combinations of elements and arrangement of parts that are adapted to affect such steps, all is exemplified in the following detailed disclosure, and the scope of the invention will be indicated in the claims.





BRIEF DESCRIPTION OF THE DRAWINGS

The patent or application file contains at least one drawing/photograph executed in color. Copies of this patent or patent application with color drawing(s)/photograph(s) will be provided by the Office upon request and payment of the necessary fee.


For a more complete understanding of the invention, reference is made to the following description and accompanying drawings:



FIG. 1 illustrates a planning horizon overlap between two planning sessions, one at a planning time at which a prior belief was calculated and the second being a current planning session, according to embodiments of the present invention;



FIG. 2 illustrates a look-ahead search performed by X-BSP over a planning horizon, according to embodiments of the present invention;



FIGS. 3A and 3B illustrates re-use of a selected branch of beliefs from a precursory planning session, according to embodiments of the present invention;



FIG. 4 illustrates a relative belief distance space, according to embodiments of the present invention;



FIGS. 5A and 5B illustrate illustrate adequate and inadequate representation of a belief by measurement samples, according to embodiments of the present invention;



FIGS. 6A-6C illustrate re-use of beliefs over a multi-step planning horizon, for two consecutive planning sessions, according to embodiments of the present invention; and



FIG. 7 illustrates a flowchart of belief space planning through incremental expectation updating, according to embodiments of the present invention.





DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS
1.0 Problem Formulation
1.1 Belief Definition

Hereinbelow, aspects of a BSP formulation and the common Maximum Likelihood (ML) approximation are defined. Let xt denote the agent's state at time instant t and custom-character represent the mapped environment thus far. The joint state, up to and including time k, is defined as Xk={x0, . . . , xk, custom-character}. We shall be using the notation t|k to refer to some time instant t while considering information up to and including time k. This time notation because hereinbelow both current and future time indices are used in the same equations. Let zt|k and ut|k denote, respectively, measurements and the control action at time t, while the current time is k. The measurements and controls up to time t given current time is k, are represented by





z1:t|k≐{z1|k, . . . , zt|k}, u0:t−1|k≐{u0|k, . . . , ut−1|k},  (2)


The posterior probability density function (pdf) over the joint state, denoted as the belief, is given by






b[xt|k]≐custom-character(Xt|z1:t|k,u0:t−1|k)=custom-character(Xt|Ht|k).  (3)


where Ht|k≐{u0:t−1|k, z1:t|k} represents history at time t given current time k. The propagated belief at time t, i.e. belief b[Xt|k] lacking the measurements of time t, is denoted by






b
[Xt|k]≐b[Xt−1|kcustom-character(xt|xt−1,ut−1|k)=custom-character(Xt|Ht|k),  (4)


where Ht|k≐Ht−1|k∪{ut−1|k}. Using Bayes rule, Eq. (3) can be rewritten as











b


[

X

t
|
k


]








(

X
0

)







i
=
1

t



[





(



x
i

|

x

i
-
1



,

u


i
-
1

|
k



)







j


M

i
|
k









(



z

i
,

j
|
k



|

x
i


,

l
j


)




]




,




(
5
)







where custom-character(X0) is the prior on the initial joint state, and custom-character(xi|xi−1, ui−1|k) and custom-character(zi,j|k|xi, lj) denote, respectively, the motion and measurement likelihood models. Here, zi,j|k represents an observation of landmark lj from robot pose xi, while the set custom-characteri|k contains all landmark indices observed at time i, i.e. it denotes data association (DA). The DA of a few time steps is denoted by custom-character1:i|k≐{custom-character1|k, . . . , custom-characteri|k}.


1.2 Belief Space Planning

The purpose of BSP is to determine an optimal action given an objective function J, belief b[Xk|k] at planning time instant k and, considering a discrete action space, a set of candidate actions custom-characterk. While these actions can be with different planning horizons, we consider for simplicity the same horizon of L look ahead steps for all actions, i.e. custom-characterk={uk:k+L−1}. The optimal action is given by








u

k
:


k
+
L
-
1

|
k




=


argmax


u

k
:


k
+
L
-
1

|
k





u
k









J


(

u

k
:


k
+
L
-
1

|
k



)




,




where the general objective function J(.) is defined as











J


(
u
)




=
.






z


k
+
1

:


k
+
L

|
k






[




i
=

k
+
1



k
+
L





r
i



(


b


[

X

i
|
k


]


,





u


i
-
1

|
k



)



]



,




(
6
)







with u≐uk:k+L−1|k, an immediate score (a reward or cost) ri and where the expectation is with respect to future observations zk+1:k+L|k while,





zk+1:k+L|k˜custom-character(zk+1:k+L|k|Hk|k,uk:k+L−1).  (7)


The expectation in (6) can be written explicitly










J


(
u
)


=





z


k
+
1

|
k









(



z


k
+
1

|
k


|

H

k
|
k



,

u

k
|
k



)


·


r

k
+
1




(
.
)




+





+




z


k
+
1

:

i
|
k










(



z


k
+
1

:

i
|
k



|

H

k
|
k



,

u

k
:


i
-
1

|
k




)


·


r
i



(
.
)




+






(
8
)







Using the chain rule and the Markov assumption, we can re-formulate the joint measurement likelihood (7), as













(



z


k
+
1

:


k
+
L

|
k



|

H

k
|
k



,

u

k
:

k
+
L
-
1




)


=




i
=

k
+
1



k
+
L







(


z

i
|
k


|

H

i
|
k

-


)







(
9
)







were Hi|kis a function of a specific sequence of measurement realization, i.e.






H
i|k

=H
k|k
∪{z
k+1:i−1|k
,u
k:i−1|k}.  (10)


Using (9), we can re-formulate (8) as











J


(
u
)


=




z


k
+
1

|
k









(


z


k
+
1

|
k


|

H


k
+
1

|
k

-


)


[



r

k
+
1




(


b


[

X


k
+
1

|
k


]


,

u

k
|
k



)


+















z

i
|
k









(


z

i
|
k


|

H

i
|
k

-


)






[



r
i



(


b


[

X

i
|
k


]


,

u


i
-
1

|
k



)


+






]



]



,




(
11
)







where each integral accounts for all possible measurement realizations from an appropriate look ahead step, with i∈(k+1, k+L] and b[Xi|k]=custom-character(Xi|k|Hi|k, zi|k).


Evaluating the objective for each candidate action in custom-characterk involves calculating (11), considering all different measurement realizations. As solving these integrals analytically is typically not feasible, in practice these are approximated by sampling future measurements. Although the measurement likelihood custom-character(zi|k|Hi|k) is unattainable, one can still sample from it. Specifically, consider the i-th future step and the corresponding Hi|kto some realization of measurements from the previous steps. In order to sample from custom-character(zi|k|Hi|k), we should marginalize over the future robot pose xi and landmarks custom-character














(


z

i
|
k


|

H

i
|
k

-


)


=




x
i















(



z

i
|
k


|

x
i


,


)


·




(


x
i

,


|

H

i
|
k

-



)




d


x
i


d










,




(
12
)







where custom-character(xi,custom-character|Hi|k) can be calculated from the belief b[Xi|k]≐custom-character(Xi|k|Hi|k). We approximate the above integral via sampling as summarized in Alg. 1. One can also choose to approximate further by considering only landmark estimates custom-character (i.e. without sampling custom-character).


Each sample χi and the determined DA (lines 1-2 of Algorithm 1) define a measurement likelihood custom-character(zi|ki,custom-characteri|ki))=Πj∈custom-characteri|ki)custom-character(zi,j|k|xi, lj) from which observations are sampled in line 3. Considering nx samples, {χin}n=1nx, we can approximate Eq. (12) by














(


z

i
|
k


|

H

i
|
k

-


)





η
i






n
=
1


n
x





w
i
n

·




(



z

i
|
k


|

χ
i
n


,




i
|
k




(

χ
i
n

)



)






,




(
13
)







where win represents n-th sample weight, χin, and ηi−1≐Σn=1nzwin. Here, since all samples are generated from their original distribution (corresponding to the proposal distribution in importance sampling), see line 1, we have identical weights.


For each sample χin∈{χin}n=1nx, we may consider nz measurement samples (line 3), providing the set {zi|kn,m}m=1nz. In other words, Alg. 1 yields nx·nz sampled measurements, denoted by {zi|k}, for a given realization of zk+1:i−1|k. Thus, considering all such possible realizations, we get (nx·nz)i−k sampled measurements for the look ahead step at time i, i.e. the (i−k)-th look ahead step.


We can now write an unbiased estimator for (11), considering the (nx·nz)i−k sampled measurements. In particular, for the look ahead step at time i, we get













z


k
+
1

:

i
|
k






[


r
i



(


b


[

X

i
|
k


]


,

u


i
-
1

|
k



)


]





η

k
+
1











{

z


k
+
1

|
k


}





w

k
+
1

n

(







(


η
i






{

z

i
|
k


}





w
i
n

·


r
i



(


b


[

X

i
|
k


]


,

u


i
-
1

|
k



)





)


















)







(
14
)







where Hi|kvaries with each measurement realization.


The above exponential complexity makes the described calculations quickly infeasible is referred to herein as X-BSP. A commonly used approximation for the X-BSP problem is the the Maximum Likelihood approximation.


1.3 Belief Space Planning under ML

A very common approximation to Eq. (6) is based on the maximum likelihood (ML) observations assumption. This approximation, referred to as ML-BSP, is often used in BSP and in particular in the context of active SLAM: Instead of accounting for different measurement realizations, only the most likely observation is considered at each look ahead step, which corresponds to nx=nz=1 where the single sample is the most likely one. So under ML, the expectation from Eq. (6) is omitted, and the new objective formulation is given by












J

M

L




(
u
)




=
.






i
=

k
+
1



k
+
L





r
i



(


b


[

X

i
|
k


]


,

u


i
-
1

|
k



)




,




(
15
)







thus drastically reducing complexity at the expense of sacrificing performance. While the future belief b[Xi|k] is given by custom-character(X0:i|Hk|k, uk:i−1, zk+1:i|kML), and for the Gaussian case zk+1:i|kML are the measurement model mean-values.


2.0 Problem Statement

We assume that the planning session at time instant k has been solved by evaluating the objective (6) via appropriate measurement sampling for each action in custom-characterk and subsequently choosing the optimal action u*k:k+L−1|k. A subset of this action, u*k:k+l−1|k∈u*k:k+L−1|k with l∈[1, L), is now executed, new measurements zk+1:k+l|k+l are obtained and the posterior belief b[Xk+l|k+l] in inference is calculated, upon which a new planning session is initiated.


Determining the optimal action sequence at time instant k+l involves evaluating the objective function for each candidate action u′≐uk+l:k+l+L−1|k+lcustom-characterk+l











J


(

u


)




=
.






[




i
=

k
+
l
+
1



k
+
l
+
L





r
i



(


b


[

X

i
|

k
+
l



]


,

u


i
-
1

|

k
+
l





)



]



,




(
16
)







where the expectation is with respect to future observations zk+l+1:k+l+L|k+l. Existing approaches perform these costly evaluations from scratch for each candidate action. Embodiments of the present invention described below exploit the fact that expectation calculations from X-BSP planning sessions at two time times, k and k+l may be similar and therefore be be “re-used.” Hereinbelow, we demonstrate an approach for evaluating the objective function (16) more efficiently by selectively re-using calculations from preceding planning sessions.


The following environmental conditions are assumed for the approach described herein:


Assumption 1 Calculations from a precursory planning session are accessible from the current planning session.


Assumption 2 The planning horizon of current time k+l, overlaps the planning horizon of the precursory planning time k, i.e. l∈[1, L).


Assumption 3 Action sets custom-characterk+l and custom-characterk overlap in the sense that actions in custom-characterk that overlap in the executed portion of the optimal action also partially reside in custom-characterk+l. In other words, ∀u∈custom-characterk with u≐{uk:k+l−1|k, uk+l:k+L−1|k} and uk:k+l|k≡u*k:k+l−1|k, ∃u′∈custom-characterk+l such that u′≐{u′k+l:k+L−1, u′k+L:k+l+L−1} and u′k+l:k+L−1∩uk+l:k+L−1|k∉∅.


3.0 Approach

The embodiments of the present invention described herein provide an incremental BSP (ix-BSP) approach, by which an objective function is calculated incrementally by re-using calculations from past experience (e.g., precursory planning sessions), thereby saving computation time while preserving the benefits of the expectation solution provided by X-BSP.


iX-BSP re-uses previous calculations by enforcing specific measurements as opposed to sampling them from the appropriate measurement likelihood distribution. The measurements being enforced were considered and sampled in some precursory planning session, in which each of the measurements had a corresponding posterior belief. By enforcing some previously considered measurements, we can make use of the previously calculated posterior beliefs, instead of performing inference from scratch. To make use of the data acquired since these re-used beliefs were calculated, when needed, we incrementally update them to match the information up to the current time.


Hereinbelow, in Section 3.1, the similarities between the calculations required for two successive planning sessions are compared, indicating the basis for re-use of prior calculations for iX-BSP. In Section 3.2 we provide an overview of the entire iX-BSP paradigm, and continue with covering each of the building blocks of iX-BSP. These include: selecting the closest branch and deciding whether there is sufficient data for calculations re-use (Section 3.3), validating samples for re-use, incorporating forced samples and belief update (Section 3.4), and calculating expectation incrementally with forced samples (Section 3.5). In addition, to demonstrate how iX-BSP can be utilized to improve approximations of the original X-BSP problem, the particular case of iML-BSP is shown (Section 3.6), demonstrating iX-BSP under a maximum likelihood (ML) assumption.


3.1 Comparing Planning Sessions

Here we examine the similarities between two planning sessions that comply with Assumptions 1- 3. For two planning sessions, both with horizon of L steps ahead, the first occurred at time k and the second at time k+l. Under Assumption 2 the two planning horizons overlap, i.e. l<L, and under Assumption 3 both planning sessions share some actions. For this comparison let us consider the action chosen at planning time k which also partially resides in a candidate action from planning time k+l, and denote both as custom-characterk:k+L={custom-characterk, . . . , custom-characterk+L}.



FIG. 1 illustrates a graph 100 showing a horizon overlap between planning times, one at a look ahead step of time t, given planning time k, i.e. b[Xt|k], and the other at planning time k+l, i.e. b[Xt|k+l]. The shared sections, separated by time instances, are denoted as (i), (ii), and (iii).


At future time t∈[k+l+1, k+L], the belief created by the action sequence custom-characterk:t−1 is given by










b


[

X

t
|
k


]






b


[

X

k
|
k


]





(
a
)



·





s
=

k
+
1



k
+
l




[





(



x
s

|

x

s
-
1



,


u



s
-
1



)







g




s
|
k









(



z

s
|
k

g

|

x
s


,

l
g


)




]





(
b
)



·






i
=

k
+
l
+
1


t



[





(



x
i

|

x

i
-
1



,


u



i
-
1



)







j




i
|
k









(



z

i
|
k

j

|

x
i


,

l
j


)




]


,




(
c
)








(
17
)







where (17)(a) is the inference posterior at time k corresponding to the lower-bar area (i) in FIG. 1, (17)(b) are the motion and observation factors of future times k+1:k+l corresponding to the lower-bar area (ii) in FIGS. 1 and (17)(c) are the motion and observation factors of future times k+l+1:t corresponding to the lower-bar area (iii) in FIG. 1. For the same future time t and the same candidate action, the belief for planning time k+l is given by,










b


[

X

t
|

k
+
l



]






b


[

X

k
|

k
+
l



]





(
a
)



·





s
=

k
+
1



k
+
l




[





(



x
s

|

x

s
-
1



,


u



s
-
1



)







g




s
|

k
+
l










(



z

s
|

k
+
l


g

|

x
s


,

l
g


)




]





(
b
)



·






i
=

k
+
l
+
1


t



[





(



x
i

|

x

i
-
1



,


u



i
-
1



)







j




i
|

k
+
l










(



z

i
|

k
+
l


j

|

x
i


,

l
j


)




]


,




(
c
)








(
18
)







where (18)(a) is the inference posterior at time k corresponding to the upper-bar area (i) in FIG. 1, (18)(b) are the motion and observation factors of past times k+1:k+l corresponding to the upper-bar area (ii) in FIG. 1 and (18)(c) are the motion and observation factors of future times k+l+1:t corresponding to the upper-bar area (iii) in FIG. 1.


Although seemingly conditioned on a different history, (17)(a) and (18)(a) are identical and denote the same posterior obtained at time k (see FIG. 1 area (i)), leaving the difference between (17) and (18) restricted to (.)(b) and (.)(c). While (17)(b) represents future actions and future measurements predicted at time k, (18)(b) represents executed actions and previously acquired measurements, this can can be seen more clearly using area (ii) in FIG. 1. At planning time k (i.e. lower bar), area (ii) denotes future prediction for the time interval k:k+l, while at planning time k+l (i.e. upper bar), the same time interval denotes past measurements, and so (17)(b) and (18)(b) are potentially different, depending on how accurate was the prediction at planning time k. The differences between the predictions made during planning and the actual measurements obtained in present time are twofold, the difference in measurement values and the difference in data association, as described further hereinbelow.


Even-though both (17)(c) and (18)(c) refer to future actions and measurements (see area (iii) in FIG. 1) they do so with possibly different values and data association since they were sampled from possibly different probability densities. Solving the objective (11), requires sampling from (12) (e.g. using Alg. 1), the samples from planning time k were sampled from custom-character(zt|k|Ht|k), while the samples from planning time k+l were sampled from custom-character(zt|k+l|Ht|k+l). These probabilities would be identical only if conditioned on the same history, i.e. only if the predictions made for time interval k:k+l at planning time k were accurate both in data association and measurement values.


As such, in order to mind the gap between (17) and (18), and obtain identical expressions one must update (17)(b) to match (18)(b), and second, to adjust the samples from (17)(c) to properly represent the updated measurement probability density.


3.2 Approach Overview

This section presents an overview of iX-BSP at planning time k+l, while the relevant precursory planning session occurred at planning time k, as summarized in Appendix C, Algorithm 2: “iX-BSP at Planning Time k+l”.


After executing l steps out of the optimal (or suboptimal) action sequence suggested by planning at time k and performing inference over the newly received measurements, we obtain b[Xk+l|k+l]. Performing planning at time k+l under iX-BSP, requires first locating the “closest belief” to b[Xk+l|k+l] from the precursory planning session (Alg. 2 line Alg. 1). The “closest belief” to b[Xk+l|k+l], denoted as {tilde over (b)}[Xk+l|k], is the one with the minimal distance to it, while considering some appropriate probability density function metric.


Hereinbelow, the following definitions of terms are used:













Variable
Description








custom-charactert|k

Of time t while current time is k


b[Xt|k]
belief at time t while current time is k


b[Xt|k]
belief at time t − 1 propagated only with action ut−1|k


{tilde over (b)}[Xt|k]
The root of the selected branch for



re-use from planning time k



custom-character
 t|k

The set of all beliefs from planning



time k rooted in {tilde over (b)}[Xt|k]


bs[Xt|k+l]
The sth sampled belief representing b[Xt|k+l]


bαs−[Xt+1|k+l]
The sampled belief bs[Xt|k+l] propagated



with the α candidate action


{bαr[Xt|k+l]}r=1n
A set of n sampled beliefs that are first order children of



bαs−[Xt|k+l] and are representing b[Xt|k+l]


bαs′−[Xt+i|k]
A propagated belief from custom-charactert|k closest to bαs−[Xt+i|k+l]


dist
The distance between bαs′−[Xt|k] and bαs−[Xt|k+l]


Dist
The distance between {tilde over (b)}[Xt|k] and the



corresponding posterior b[Xt|t]


{br[Xt|k]}r=1n
A set of n sampled beliefs representing b[Xt|k]


nu
number of candidate actions per step


(nx · nz)
number of samples for each candidate action


data
All available calculations from precursory



planning session


uk:k+L|k*
The (sub)optimal action sequence of length



L chosen in planning at time k


ϵc
belief distance critical threshold, above it re-use



has no computational advantage


ϵwf
wildfire threshold,



less that which is considered close enough



for re-use without update


useWF
a binary flag,



determining whether or not the wildfire



condition is considered


DPQ(p, q)
The distance between distributions p and q,



according to the DPQ metric


DDA (p, q)
The divergence between distributions p and q,



according to the data association difference



custom-charactert|k

Data association (DA) at time t while current time is k









If the distance of the closest belief (denoted by Dist) is larger than some critical value ϵc, i.e. the closest prediction from precursory planning session is too far off, iX-BSP would have no advantage over the standard X-BSP, so the latter is executed (Alg. 2 line 12).


On the other hand, if Dist meets the wildfire condition, i.e. it is smaller than the critical value ϵ, we consider the difference between the beliefs as insignificant and continue with re-using the precursory planning session without any additional update (Alg. 2 line 4).


When precursory planning is “close-enough” (i.e., less than a predefined threshold, as described below), we can re-use it to save computation time. Under iX-BSP, the closest belief from planning at time k (denoted as {tilde over (b)}[Xk+l|k]) to the posterior received at current time b[Xk+l|k+l] is then compared to the posterior. If the difference between the closest belief and the posterior is unacceptable, X-BSP is performed. The full process of selecting the closest belief and updating the posterior is described in Alg. 2 (Appendix C).


Given {tilde over (b)}[Xk+l|k], we can extract all future beliefs, i.e. k+l+1:k+L, from planning time k which originate in {tilde over (b)}[Xk+l|k]. These beliefs are updated with the information received in inference between time instances k+1 and k+l, and predicted measurements (line 6) are selectively re-sampled to maintain a representative set of samples for the nominal distribution. In case one of the aforementioned beliefs also meets a “wildfire” condition, that is, it is sufficiently close to the posterior being estimated, it and all of its descendants are considered already updated. Once the update is complete, there is a planning horizon of just L−l steps, i.e. to the extent of the horizon overlap, hence we need to calculate the remainder of the planning horizon from scratch, i.e. perform X-BSP for the final l steps (line 8). We are now in position to update immediate score functions (i.e., the reward or cost functions) for the updated belief and calculate the expected values (i.e., the “objective values”) of the candidate actions in search of the optimal (or suboptimal) action sequence (line 10), thus completing the planning for time k+l. Hereinbelow the immediate scores calculated for the posterior beliefs are assumed to be either rewards or costs.


Since we are re-using samples from different planning sessions at planning time k+l we are required to compensate for the different measurement likelihood, through proper formulation. In the sequel we show that our problem falls within the Multiple Importance Sampling problem (see Appendix A), so we estimate the expected immediate scores (reward or cost values) using importance sampling based estimator, thus completing the planning for time k+l. As opposed to X-BSP which returns only the selected action sequence, iX-BSP is also required to return more data from the planning process in order to facilitate re-use (line 14).


3.3 Selecting Closest Branch


FIG. 2 shows a graph 200, demonstrating a look-ahead search performed by X-BSP on a tree with depth L. Each belief tree node represents a belief. For each node, the tree branches either for a candidate action or a sampled measurement. The corresponding belief tree for ML-BSP is marked with solid lines, while the dashed lines represent the parts of X-BSP that relate to sampled measurements. Under iX-BSP, the gray-marked parts of the tree are being re-used for the succeeding planning session.


Since in X-BSP the number of beliefs per future time step is derived from the number of samples per action per time step, in order to re-use previous calculations we need to choose which beliefs to re-use. Without loss of generality, in this section we would make use of FIG. 2 to illustrate the branch selection process made in SelectClosestBranch. FIG. 2 illustrates a planning session of X-BSP at planning time t=1 for a horizon of L steps, with n candidate actions and j sampled measurements per step, resulting with (n·j)1 different beliefs for future time t=2 and (n·j)L−1 for future time L. Let us assume action u1n has been determined as optimal at planning time t=1 and has been executed, after attaining new measurements for current time t=2 and calculating the posterior belief b[X2|2], we perform planning once more. This time, instead of calculating everything from scratch we would like to re-use the beliefs calculated at planning time t=1, but which beliefs should be considered for re-use?


Our goal in selecting beliefs for re-use is to minimize any required update, i.e. the beliefs we would like to re-use should be as “close” as possible to the beliefs we would have obtained through standard X-BSP. We can determine the distance between two beliefs through some appropriate metric, e.g. the DPQ metric described in Appendix B. Choosing a close belief ensures a minimal required update both for this belief and its descendants, while choosing an almost identical belief would require no update at all, as described hereinbelow with respect to locating the closest belief.


Going back to our example in FIG. 2, considering all beliefs b[X2|1] meeting Assumption 3, i.e. all beliefs marked by the dark gray area. We consider the closest belief as the one with the minimal distance to the posterior b[X2|2] and denote it as {tilde over (b)}[X2|1]. Once we determined {tilde over (b)}[X2|1], we define the closest branch as consisting of all the beliefs rooted in {tilde over (b)}[X2|1], and denote it as custom-character2|1, marked in FIG. 2 by the light gray area. The SelectClosestBranch method requires the posterior and the data from precursory planning session as inputs, and returns the root of the selected branch {tilde over (b)}[Xk+1|k] along with its distance to the posterior (denoted as Dist).


3.3 Re-use Existing Beliefs

Once we determined {tilde over (b)}[Xk+1|k] (Alg. 2, line 1), we consider all of its descendants as candidate beliefs for re-use from the precursory planning session and denote this set as custom-characterk+l|k. For example, in FIG. 2, the set custom-character2|1, which is denoted by a light gray area 210, consists of all beliefs rooted in {tilde over (b)}[X2|1]. Once custom-characterk+l|k has been determined, we can utilize it following Alg. 3, to obtain all immediate scores (reward or cost values) up to time k+L while adjusting these appropriately, as discussed next. Without loss of generality, in this section we would make use of FIG. 3 to illustrate the belief update process, where FIG. 3: k represents the selected branch for re-use from FIG. 2custom-character2|1, and FIG. 3: k+1 illustrates the building of planning tree at time t=2 through selective re-use of beliefs from custom-character2|1.



FIG. 3, the belief update process of iX-BSP presented in a belief tree, where each node represents a belief that branches either for one of n candidate actions or j sampled measurements. The selected branch for re-use from FIG. 2, denoted by custom-character2|1, is presented in (a) and as a water mark in (b). The succeeding iX-BSP planning session at time t=2 is illustrated in (b), where the re-used sampled measurements and succeeding beliefs are marked in light blue.


The purpose of Alg. 3 is creating the planning tree of planning time k+l through selective re-use of custom-characterk+l|k. The process starts with the posterior b[Xk+l|k+l], and continue with every new belief b8[Xi|k+l] that is added to the new planning tree up to future time k+L, where s accommodates all different sampled beliefs at future time i.


First, we check whether the belief bs[Xi|k+l] was created under the wildfire condition (line 3), i.e. directly taken from custom-characterk+l|k without any update. In case it did, we continue to take its decedents directly from the appropriate beliefs in custom-characterk+l|k without any update (line 5) and mark them as created under the wildfire condition (line 6). For the case where the belief bs[Xi|k+l] has not been created under the wildfire condition we propagate it with the α candidate action, where α∈[1, nu] and nu denotes the number of candidate actions per time step (line 9)—bαs−[Xi+1|k+l]. We then consider all propagated beliefs b[Xi+l|k]⊂custom-characterk+l|k, and search for the closest one to bαs−[Xi+1|k+l], in the sense of belief distance. Once found, we denote the closest propagated belief as bαs′−[Xi+l|k] (line 10).


In case there is no such belief close enough to make the update worthwhile, i.e. dist>ϵc, for this candidate action we continue as if using X-BSP (line 21). In case the distance of the closest belief meets the wildfire condition ϵ, we consider all beliefs b[Xi+1|k]⊂custom-characterk+l|k that are rooted in bs′−[Xi+1|k]. Otherwise we continue and check whether the samples generated using bαs′−[Xi+1|k] constitute an adequate representation for custom-character(zi+1|k+l|Hi|k+l, ui═k+lα), and re-sample if needed (line 18, see Section 3.4.1). Once we obtain the updated set of samples {repSamples}1nx·nz, whether all were freshly sampled, entirely re-used or somewhere in between, we can acquire the set of posterior beliefs for look ahead step i+1−{bα[Xi+1|k+l]}1nx·nz (line 23) through an update, as discussed in the sequel (see Section 3.4.3). Once all updated beliefs for future time i+1 have been updated, we can update the immediate scores (reward or cost values) of each posterior belief (see Section 3.4.4).


We repeat the entire process for the newly acquired beliefs {bα[Xi+1|k+l]}1nx·nz, and so forth, up to k+L.


We will now demonstrate Alg. 3 (see Appendix C) using FIG. 3. FIG. 3A illustrates the selected branch for re-use from a precursory planning session at time t=1 i.e. custom-character2|1 (see FIG. 2). There are n candidate actions each step, and for each candidate action there are j sampled measurements. FIG. 3B illustrates a part 300 of the planning tree at time t=2, where the top of the tree is the posterior belief of current time t=2, i.e. b[X2|2], from which we start the algorithm. Since b[X2|2] is the posterior of current time t=2, it was not created under the wildfire condition, so we jump directly to line 8. We propagate b[X2|2] with each of the n candidate actions starting with u21, and obtain the left most belief in the second level of the tree—b[X3|2] (line 9). Using BeliefDist(.) we obtain the closest belief from custom-character2|1 to b[X3|2] as well as their distance dist. For our example the closest belief turns out to be the one which was also propagated by the same candidate action—the left most b[X3|1] in the second level of custom-character2|1. Since the distance suggests re-use is worth while but does not meet the wildfire condition, i.e. ϵ<dist≤ϵc, we proceed to line 17.


We denote the set (of sets) of all j sampled measurements as samples, i.e. samples←{{z3|1}1, . . . , {z3|1}j}. Using IsRepSample(.) we obtain a representative set for the measurement likelihood custom-character(z3|2|H2|2, u21) (see Section 3.4.1). As we can see in FIG. 3: k+1, other than {{z3|1}1, which was re-sampled, all other samples are re-used (denoted by blue arrows). Once we have a representative set of measurements, which in our case all but one are re-used from planning time t=1, we can update the appropriate beliefs using UpdateBelief(.) (see Section 3.4.3). The belief resulting from the newly sampled measurement {z3|2}1 is calculated from scratch by adding the measurement to b[X3|2] and performing inference, while the re-used samples allow us to incrementally update the appropriate beliefs from custom-character2|1, rather than calculate them from scratch (see Section 3.4.3).


We now have an updated set of beliefs for future time t=3, that consider the candidate action u21. For each of the aforementioned beliefs we incrementally calculate the appropriate immediate scores (reward or cost values) value (see Section 3.4.4) thus completing the incremental update for candidate action u21. We repeat the aforementioned for the rest of the candidate actions, thus completing the third level of the planning tree presented in FIG. 3: k+1. In a similar manner we continue to incrementally calculate the deeper levels of the planning tree up to future time t=L, thus concluding Alg. 3 (Appendix C).


As the planning horizon of planning at time k concludes at look ahead step k+L, with the completion of Alg. 3, we conclude the re-use of beliefs from precursory planning session, thus obtaining the beliefs for look ahead steps k+l+1:k+L, required for immediate scores (reward or cost values) function calculation. While beliefs for the rest of the horizon k+L+1:k+L+l, are obtained by performing X-BSP (Alg. 2 (Appendix C), line 8). In the sequel we elaborate on belief distance (Section 3.4.1), determining whether samples are representative or not (Section 3.4.1), the process of belief update given the representative set of samples {repSamples}1n (Section 3.4.3) and the incremental calculation of the immediate scores (reward or cost values) per sampled belief.


3.4.1 Locating Closest Belief

In this section we address the problem of locating the closest belief, as required when selecting the closest branch for re-use (Alg. 2 line 1) or when re-using existing beliefs (Alg. 3 line 10) as part of iX-BSP paradigm (Alg. 2 line 6). FIG. 4 shows an illustration of the relative belief distance space. Each point in this space represents some belief b[Xt|k], where the black dot denotes b[Xt|k+l] as the origin. All beliefs b[Xt|k] close to the origin up to ϵ, may be re-used without any update calculations. All beliefs b[Xt|k] close to the origin up to ϵc but farther than ϵmay be re-used with some update calculations. All beliefs b[Xt|k] that are more than ϵc away from the origin are considered as not “close enough” to make a re-use worthwhile.


As part of our problem, we have a set of candidate beliefs for re-use, denoted as custom-charactertk, and either a posterior b[Xt|k+l] or a propagated posterior b[Xt|k+l] we wish to be close to. The problem is to find the closest belief in the set custom-charactertk to b[Xt|k+l]. Although iX-BSP could also be performed under a non-metric distance (as discussed later in this section), the use of a belief metric allows us to have fixed criteria over the distance values, as well as utilize the advantages of metric space for computational efficiency.


A belief metric quantifies the differences between two beliefs and puts it in terms of a scalar distance. Hereinbelow, we describe one alternative for the DPQ metric, as presented in Endres and Schindelin, “A new metric for probability distributions,” IEEE Transactions on Information Theory, 2003, incorporated herein by reference (Endres). This is a metric for general probabilities that also has a special form in the case of Gaussian beliefs (for a full derivation see Appendix B). For two Gaussian beliefs bp[Xt|k+l] and bq[Xt|k] with μp and μq as their first moments and Σp and Σq as their second moments appropriately, the DPQ distance between them is given by,











D

P

Q




(



b
p



[

X

t


k
+
l



]


,


b
q



[

X

t

k


]



)


=


1
4











(





μ
p

-






μ
q




)

T



[





Σ
q

-
1


+






Σ
p
-




]




(


μ
p

-

μ
q


)


+










tr


(


Σ
q

-
1




Σ
p


)


+







tr


(


Σ
p

-
1




Σ
q


)


-

d
p

-

d
q














(
19
)







where dp and dq are the joint state dimension of bp[Xt|k+l] and bq[Xt|k] appropriately.


To locate the closest belief to b[Xt|k+l], we may consider a belief metric space, in which each point is a unique projection of a candidate belief denoting the distance between the aforementioned candidate belief and b[Xt|k+l]. FIG. 4 illustrates such a space 400, where the black dot 410 at the center of the space represents DPQ(b[Xt|k+l], b[Xt|k+l]), and the rest of the white points 420, 430, and 440 denote DPQ(b[Xt|k], b[Xt|k+l]). We divide distances around the null projection into three areas, separated by threshold parameters ϵand ϵc. (When wildfire is not enabled, ϵ=0 i.e. the distances are divided into two areas, separated by a single parameter ϵc.)


For a belief from a prior planning session b[Xt|k], although referring to the same future time t as b[Xt|k+l], the belief is conditioned over a different history, therefore it is potentially different. While Section 3.1, discussed the reasons for such a difference between b[Xt|k] and b[Xt|k+l], in this section we quantify this difference using a belief metric.


Projecting b[Xt|k] to our belief distance space yields a point that suggests how different b[Xt|k] is from b[Xt|k+l], e.g. the white dot 440 suggests a larger difference than the white dot 420. After projecting all candidate beliefs from custom-charactert|k into the belief metric space in reference to b[Xt|k+l]), the problem of locating the closest belief to b[Xt|k+l]) is reduced to a problem of locating the nearest neighbor.


In order to simplify the selection process and avoid dealing with belief metric space, a straightforward method may be applied to get a qualitative distance between custom-charactertk and b[Xt|k+l]. We denote this method as DDA. This is a qualitative distance and not a metric. This simplification may be relaxed to determine a distance between beliefs with a belief metric DPQ presented in Appendix B. Under DDA, all candidate beliefs are first sorted according to DA differences, looking for the smallest available difference. In case there is more than a single belief with a minimal difference, we continue to sort the remaining beliefs according to the difference between values of corresponding predicted measurements and similarly look for the minimal difference. In case there is more than a single belief with minimal measurement value difference, we select arbitrarily out of the remaining beliefs, and consider the chosen belief as the closest one. A detailed explanation of the DA matching process as described in PCT application WO2019/171378.


Any method known in the art may be applied to determine the closest belief (e.g. storing the distances in a self-balancing binary search tree). Alg. 4 presents a trivial yet effective way to determine the closest belief given a set of beliefs custom-characterk+l|k and a target belief b[Xi+1|k+l], by simply calculating the distance between the target belief to each belief in the set custom-characterk+l|k, and picking the closest one. The distance calculation may be done using a belief metric, such as the DPQ distance (see Appendix B) although any proper belief metric would suffice.


A distinction as to which area δmin belong to, is determined outside of Alg. 4 (see Alg. 2 lines 2-3 and Alg. 3 lines 11-12).


3.4.2 Representative Sample

This section covers the problem of obtaining a set of measurement samples that are representative of the measurement likelihood distribution from which we should sample. The motivation for re-using previously sampled measurements lies within the desire to refrain from performing inference. As explained in Section 3.1, assuming the differences between (17)(b) and (18)(b) have been resolved (the predicted factors and their counterparts that already have been obtained respectively), the difference between Eq. (17) to Eq. (18) is limited to the difference between (17)(c) and (18)(c). Assuming both use the same action sequence, the difference between (17)(c) and (18)(c) is limited to the predicted measurements being considered by each.


Under the sampling paradigm presented in Alg. 1, it is sufficient to determine the representativeness of a measurement sample based on the state sample χ which should be sampled from the propagated belief b[Xi|k+l]. The sampling method shown in Alg. 1 describes sampling from the unknown measurement likelihood custom-character(zi|k|Hi|k). We margin over all possible states and are left with Eq. (12), consisting out of two terms which we can access, the measurement model custom-character(zi|k|xi, custom-character) and the marginal distribution over the state at time i, custom-character(xi, custom-character|Hi|k) obtained from b[Xi|k]. The sampling process is comprised of two stages. The first stage is sampling nx states for time i using the propagated belief b[Xi|k], each such state defining a measurement model distribution, from which we sample nz measurements. At the end of this process we are left with (nx·nz) sampled measurements.


Considering the known (stochastic) measurement model, the space of measurement model distributions is uniquely defined by the set of state samples. So, in order to simplify the selection of representative measurement samples, we consider only the set of sampled states and assume that a set of sampled stated that are representative of the propagated belief they should have been sampled from, yields a set of representative sampled measurements. Following the aforementioned, the problem of determining a set of representative measurement samples becomes a problem of determining a set of state samples representative of some propagated belief.


Let us consider samples and b[Xi|k+l] denoting respectively the candidate measurements for re-use and the propagated belief from planning time k+l. Due to the fact that samples were sampled from distributions different from custom-character(zi+1|k+l|Hi+1|k+l), we need to assure they constitute an adequate representation of it. In Alg. 5 we consider the sampled states that led to the acquired sampled measurements (Alg. 5 line ??), and denote it as stateSamples. We consider each sampled state separately, and determine sample ∈ stateSamples as representative if it falls within a predetermined σ range (e.g., a pre-determined variance range) of the distribution it should have been sampled from.



FIGS. 5A and 5B illustrate adequate and inadequate representation of a belief by samples. FIG. 5A illustrates a belief 500 (denoted by a black ellipse) over the propagated joint state at future time k+3 calculated as part of planning at time k, and twenty two samples (denoted by green “+” signs) taken from it. FIG. 5B Illustrates two instantiations of beliefs over the propagated joint state at a future time k+3 calculated as part of planning at time k+1 (denoted in red and blue), overlapping the belief 500 of the precursory planning time and its samples. While the samples of belief 500 can be considered as an adequate representation for the blue belief, they can be considered as inadequate representation of the red belief, given the dearth of samples contained within the predefined variance range. FIG. 5B illustrates a set of states χ (denoted by green “+” signs) used in future time k+3 under planning time k, where FIG. 5B illustrates how well the same samples represents two instantiations of the same future time k+3 in succeeding planning at time k+1. By considering some ±βσ·σ a range of each instantiation of b[Xk+3|k+1] we can determine which of the available samples can be considered as representative. Following Alg. 5, for a value of βσ=1, under the blue belief instantiation in FIG. 5, all but the left most sample will be considered as representative of b[Xk+3|k+1] since they are within the covariance ellipsoid, representing the ±la range. While under the red belief instantiation only the three samples within the red covariance ellipsoid will be considered as representative where the rest will be re-sampled from the nominal distribution. In order to facilitate the use of importance sampling in solving the expected immediate score (i.e., reward or cost value) one should have access to the importance sampling distributions denoted by {q(.)}1n in Alg. 5.


3.4.3 Belief Update as Part of Immediate Score Calculations

Once we determine a set of n samples we wish to use (Alg. 3 line 18), whether newly sampled, re-used or a mixture of both, we can then update the appropriate beliefs in order to form the set {br[Xi+1|k+l]}r=1n, required for calculating the immediate scores (i.e., reward or cost function values) at the look ahead step i+1. This section describes the belief update process, which is case sensitive to whether a sample was newly sampled or re-used. We start with the standard belief update for newly sampled measurements; continue with determining the difference between some generic belief b[Xi+1|k+l], and its counterpart from planning time k, i.e. b[Xi+1|k]; and conclude with belief update for a re-used measurement.


For a newly sampled measurement zi+1|k+l, we follow the standard belief update of incorporating the measurement factors to the propagated belief b[Xi+1|k+l] as in











b


[

X


i
+
1

|

k
+
l



]






b
-



[

X


i
+
1

|

k
+
l



]


·




j





i
+
1

|

k
+
l










(



z


i
+
1

|

k
+
l


j

|

x

i
+
1



,

l
j


)





,




(
20
)







and then performing inference; hence no re-use of calculations from planning time k.


As mentioned earlier, the motivation for re-using samples is to evert from the costly computation time of performing inference over a belief. Since we already performed inference over beliefs at planning time k, if we re-use the same samples, we can evert from performing standard belief update (20), and utilize the beliefs from planning time k.


In summary, the necessary adjustments required to perform on a belief from planning time k that is to be re-used is as follows. The factors of two beliefs over the same future time but different planning sessions could be divided into three groups as illustrated in FIG. 1: (i) representing shared history which is by definition identical between the two; (ii) representing potentially different factors since they are predicted for time k and given for time k+l; (iii) represents future time for both, but each is conditioned over different history subject to (ii), so also potentially different.


Let us consider the measurements zi+1|k⊂{repSamples}1n tagged for re-use. The belief we are required to adjust is the one resulting from zi+1|k at planning time k, i.e.





b[Xi+1|k]∝custom-character(X0:i|Hi|k,zi+1|k).  (21)


Although b[Xi+1|k is given to us from precursory planning it might require an update to match the new information received up to time k+l. Diverging from (20), the present approach updates b[Xi+1|k] incrementally without performing a new inference. The process of incrementally updating a belief can be divided into two general steps: first updating the DA and then the measurement values. Let us consider the belief we wish to re-use from planning time k











b


[

X


i
+
1

|
k


]






b


[

X

k
|
k


]





(
a
)



·





s
=

k
+
1



k
+
l




[





(



x
s



x

s
-
1



,



u


s

-
1


)







g




s

k













(



z

s

k

g



x
s


,

l
g


)




]





(
b
)



·









r
=

k
+
l
+
1


i




[





(



x
r



x

r
-
1



,



u


r

-
1


)







j




r

k













(



z

r

k

j



x
r


,

l
j


)




]

·




(



x

i
+
1




x
i


,


u


i


)







(
c
)



·





n





i
+
1


k









(



z


i
+
1


k

n



x

i
+
1



,

l
n


)






(
d
)





,




(
22
)







and the propagated belief we obtained amidst building the planning tree of time k+l











b
-



[

X


i
+
1

|

k
+
l



]






b


[

X

k
|

k
+
l



]





(
a
)



·





s
=

k
+
1



k
+
l




[





(



x
s



x

s
-
1



,



u


s

-
1


)







g




s


k
+
l














(



z

s


k
+
l


g



x
s


,

l
g


)




]





(
b
)



·










r
=

k
+
l
+
1


i




[





(



x
r



x

r
-
1



,



u


r

-
1


)







j




r


k
+
l














(



z

r


k
+
l


j



x
r


,

l
j


)




]

·




(



x

i
+
1




x
i


,


u


i


)







(
c
)



.






(
23
)







The purpose of this incremental update of b[Xi+1|k] is to obtain b[Xi+1|k+l] such that











b


[

X


i
+
1

|

k
+
l



]






b
-



[

X


i
+
1

|

k
+
l



]


·




n





i
+
1

|
k









(



z


i
+
1

|
k

n

|

x

i
+
1



,

l
n


)





,




(
24
)







through the update of b[Xi+1|k] rather than through the update of b[Xi+1|k+l].


As discussed in Section 3.1, the only factors in Eq. (22) that do not require update are (22)(a), which is identical to (23)(a), and (22)(d), which is being entirely re-used, where the rest might require some update. Under the assumption that both Eq. (22) and Eq. (23) share the same action sequence, the required update is restricted to the measurement factors.


In order to update the measurement factors of (22)(b)(c) to match (23)(b)(c) we start with matching their data association (DA), custom-characterk+1:i|k and custom-characterk+1:i|k+l receptively. As described in PCT application WO2019/171378, to the inventors of the present invention, this DA matching provides the indices of the factors whose DA should be updated, as well as factors that should be added or removed from custom-characterk+1:i|k in order to match custom-characterk+1:i|k+l. The DA update process is being done over the graphical representation of the belief, i.e. the factor graph. Once the DA update is complete we are left with updating the measurement values of all the consistent DA factors. In other words, the update process can be viewed as having two steps: first, evaluating an information gap which includes DA matching to determine measurements with DA to update, to add or to remove; second, performing an update of posterior belief comprises performing corresponding DA modifications to generate updated or added measurements and responsively to update one or more corresponding measurement values. For the special case of Gaussian beliefs, see PCT application WO2019/171378, cited above. Once the update is complete, n beliefs are obtained representing b[Xi+1|k+l], each belief containing one of the n measurement samples {repSamples}1n.


3.4.4 Calculating Immediate Scores

Once we have a set of beliefs representing the possible futures of executing some action ui at future time i+1 (Alg. 3 line 23), we need to calculate the reward or cost value resulting from each such belief (Alg. 3 line 24) as part of the effort to solve the expected objective function. In this section we address the process of obtaining the immediate scores (reward or cost values) for each of the aforementioned beliefs, whether they were entirely re-used from a different planning session, newly calculated or a mixture of both, we conclude with the special form of weighted immediate scores (reward or cost values) function, and how can it benefit from iX-BSP paradigm. As the immediate value calculation is sensitive to the origin of the belief, we cover the three possible scenarios: newly calculated beliefs, partially re-used/updated beliefs, and entirely re-used beliefs. Let us define some general immediate scores (reward or cost values) function for planning horizon t





rt:custom-character,custom-charactercustom-character1,  (25)


where custom-character and custom-character respectively denote belief and action For a newly calculated belief, i.e. no re-use has been made, the immediate scores (reward or cost values) is calculated in a standard manner.

  • when the immediate value is a linear combination of some rewards or costs we can perform re-use without recalculating the value, just changing the weights.


3.5 Incremental Expectation Calculation with Importance Sampling

Once we obtained values for candidate actions along the planning horizon (Alg. 2, lines 6-8), we need to use them in order to estimate (16). Because we are selectively re-using samples from precursory planning sessions, we estimate (16) using samples not necessarily taken from custom-character(zk+l+1:i|k+l|Hk+l|k+l, uk+l:i−1|k+l), thus the formulation should be adjusted accordingly.


Applying the standard general formulation for the objective function, a simple example demonstrates iX-BSP and shows how the standard general formulation may be adjusted. We then characterize the problem as a Multiple Importance Sampling (MIS) problem, demonstrating how the same simple example is solved under an MIS formulation. We then provide a general formulation for the objective under MIS as part of iX-BSP.


Under the standard formulation for Eq. (16), sampling (nx·nz) measurements per candidate step follow Alg. 1:










J


(

u


)


=




i
=

k
+
l
+
1



k
+
l
+
L






[







η

k
+
l
+
1













{

z


k
+
l
+
1

|

k
+
l



}





ω

k
+
l
+
1

n

(







(






η
i










{

z

i
|

k
+
l



}





ω
i
n





·


r
i



(



b
n



[

X

i
|

k
+
l



]


,

u


i
-
1



k
+
l




)





)




)


]


,








(
26
)







where ωin denotes the weight of the nth measurement sample for future time i, ηi denotes the normalizer of the weights at time i such that ηi−1n=1nx·nz ωin, and bn[Xi|k+l] is the belief considering a specific set of samples up to future time i, i.e.





bn[Xi|k+l]≐custom-character(Xi|Hk+l|k+l,uk+l:i−1|k+l, {zk+l+1|k+l}, . . . , {zi|k+l}


When the measurements that are used to estimate the expectation are being sampled from their nominal distributions all weights equal 1, i.e. ωin=1 ∀i, n, and each normalizer equals the inverse of the sum of samples, i.e. ηi−1=nx·xz ∀i










J


(

u


)


=




i
=

k
+
l
+
1



k
+
l
+
L





[


1


(


n
x

·

x
z


)

i








{

z


k
+
l
+
1

|

k
+
l



}



















{

z

i
|

k
+
l



}





r
i



(



b
n



[

X

i
|

k
+
l



]


,





u


i
-
1

|

k
+
l




)






]

.






(
28
)







To better understand how the sampled distributions are different under iX-BSP let us perform iX-BSP over a simple example. Assume we have access to all calculations from planning time k, in-which we performed X-BSP (or ix-BSP) for a horizon of three steps, and with nx=2 and nz=1. FIG. 6A illustrates a specific action sequence, u1→u2→u1, considered as part of planning at time k. Let us assume that the optimal action decided upon as part of planning at time k, and was later executed was u1. We are currently at time k+1, after performing inference using the measurements we received as a result of executing u1 we perform planning using iX-BSP with the same horizon length and number of samples per action, for several action sequences, one of which is the action sequence u2→u1→u2, as illustrated in FIG. 6C.


Following Alg. 2 line 1, out of the two available beliefs from planning time k FIG. 6A, {b[Xk+1|k]}12, the left one is determined as closer to b[Xk+1|k+1], so we consider all its descendants as the set custom-characterk±1|k, as illustrated in FIG. 6B, and denote the distance between b[Xk+1|k] and b[Xk+1|k+1] as Dist. Because Dist is determined as close enough for re-use, we can continue with re-using the beliefs in the set custom-characterk+1|k (Alg. 2 line 6). First we check which of the two available sampled measurements from planning time k constitute adequate representation for custom-character(zk+2|Hk+1|k+1, u2). One way to do so, is following Alg. 5 and checking whether the two available state samples from planning time k constitute an adequate representation for b[Xk+2|k+1]; since they are, we consider all measurements associated to them as a representative set of custom-character(zk+2|Hk+1|k+1, u2).


Our representative set of measurement samples for look ahead step k+2 now holds two re-used measurements, so we update their corresponding beliefs {b[Xk+2|k]}12 into {b[Xk+2|k+1]}12 (Alg. 3 line 23), the updated beliefs are denoted by thick dashed ellipses in FIG. 6C. After updating the beliefs we can calculate/update the immediate scores associated with them, see Section 3.4.4, once obtained we can proceed to the next future time step.


For the next look ahead step, we propagate {b[Xk+2|k+1]}12 with action u1 to obtain {b[Xk+3|k+1]}12 (Alg. 3 line 9), and check whether the four available measurement samples from planning time k constitute an adequate representation for custom-character(zk+3|Hk+2|k+1, u1) (Alg. 3 line 18); following Alg. 5 we find only three of them are, so we mark the associated beliefs for re-use, and sample the forth measurement from the original distribution custom-character(zk+3|Hk+2|k+1, u1). We then update the beliefs we marked for re-use, {b[Xk+3|k]}13 into {b[Xk+3|k+1]}13 (denoted by the beliefs in thick dashed ellipses at k+3|k+1 in FIG. 6C, and b[Xk+3|k+1] into b4[Xk+3|k+1] (denoted by the black colored belief at k+3|k+1 in FIG. 6C) using the newly sampled measurement (Alg. 3 line 23). After obtaining the beliefs for look ahead step k+3, whether through updating a re-used belief or calculation from scratch, we calculate/update the immediate scores of each. Since we do not have candidate beliefs to be re-used for the next time step, the last step of the horizon k+4|k+1 is calculated using X-BSP (Alg. 2 line 8).


At this point we have all the immediate scores for each of the predicted beliefs along the action sequence u2→u1→u2, so we can calculate the expected reward (or cost) value for this action sequence for planning at time k+1. For the look ahead step k+2 of planning session at time k+1, i.e. k+2|k+1, we have two immediate scores, {rk+2|k+1(b[Xk+2|k+1], u2)}12, each calculated for a different belief b[Xk+2|k+1] considering a different sample zk+2|k. Calculating the expected reward(cost) value for future time step k+2|k+1 would mean in this case, using measurements sampled from custom-character(zk+2|Hk+1|k, u2) rather then from custom-character(zk+2|Hk+1|k+1, u2). This problem, of performing estimation using forced samples is called importance sampling. Since for a single time step we might have samples from multiple different distributions, e.g. look ahead step k+3|k+1 in FIG. 6C, our problem falls within the special case of Multiple Importance Sampling (MIS) (see Appendix A).


The use of MIS enables us to calculate expectation while sampling from a mixture of probabilities, where the balance heuristic is used to calculate the weight functions in MIS. Using the formulation of MIS along with the balance heuristic presented in Eq. (41), we can write down the estimation for the expected reward value of look ahead step k+2|k+1,















[


r


k
+
2

|

k
+
1





(
.
)


]








1
2






p
1



(

z


k
+
2

|
k

1

)




2
2




q
1



(

z


k
+
2

|
k

1

)




·


r


k
+
2

|

k
+
1


1



(
.
)




+


1
2






p
1



(

z


k
+
2

|
k

2

)




2
2




q
1



(

z


k
+
2

|
k

2

)




·


r


k
+
2

|

k
+
1


2



(
.
)





,




(
29
)







where p1(.)≐custom-character(zk+2|Hk+1|k+1, u2) and q1(.)≐custom-character(zk+2|Hk+1|k, u2).


In the same manner, following (41), we can also write down the estimation for the expected reward(cost) value at look ahead step k+3 of planning at time k+1, i.e. k+3|k+1,














[


r


k
+
3

|

k
+
1





(
.
)


]
















1
4






p
~

2



(

z


k
+
2

:


k
+
3

|
k


1

)








3
4





q
~

2



(

z


k
+
2

:


k
+
3

|
k


1

)



+







1
4





p
~

2



(

z


k
+
2

:


k
+
3

|
k


1

)










r


k
+
3

|

k
+
1


1



(
.
)



+


1
4






p
~

2



(

z


k
+
2

:


k
+
3

|
k


2

)








3
4





q
~

2



(

z


k
+
2

:


k
+
3

|
k


2

)



+







1
4





p
~

2



(

z


k
+
2

:


k
+
3

|
k


2

)










r


k
+
3

|

k
+
1


2



(
.
)



+


1
4






p
~

2



(

z


k
+
2

:


k
+
3

|
k


3

)








3
4





q
~

2



(

z


k
+
2

:


k
+
3

|
k


3

)



+







1
4





p
~

2



(

z


k
+
2

:


k
+
3

|
k


3

)










r


k
+
3

|

k
+
1


3



(
.
)



+


1
4






p
~

2



(

z


k
+
2

:


k
+
3

|

k
+
1



4

)








3
4





q
~

2



(

z


k
+
2

:


k
+
3

|

k
+
1



4

)



+







1
4





p
~

2



(

z


k
+
2

:


+
3

|

k
+
1



4

)










r


k
+
3

|

k
+
1


4



(
.
)




,





(
30
)







where {tilde over (p)}2(.)≐custom-character(zk+2:k+3|Hk+1|k+1, u2, u1) and {tilde over (q)}2(.)≐custom-character(zk+2:k+3|Hk+1|k, u2, u1). When considering (9), we can re-write the measurement likelihood from (30) into a product of measurement likelihoods per look ahead step, e.g. {tilde over (p)}2(zk+2:k+3|k1)=p1(zk+2|k1)p2(zk+3|k1),
















[


r


k
+
3



k
+
1





(
.
)


]


~

1
4







p
1



(

z


k
+
2


k

1

)





p
2



(

z


k
+
3


k

1

)






3
4




q
1



(

z


k
+
2


k

1

)





q
2



(

z


k
+
3


k

1

)



+


1
4




p
1



(

z


k
+
2


k

1

)





p
2



(

z


k
+
3


k

1

)








r


k
+
3



k
+
1


1



(
.
)



+


1
4






p
1



(

z


k
+
2


k

1

)





p
2



(

z


k
+
3


k

2

)






3
4




q
1



(

z


k
+
2


k

2

)





q
2



(

z


k
+
3


k

2

)



+


1
4




p
1



(

z


k
+
2



1

)





p
2



(

z


k
+
3


k

2

)








r


k
+
3



k
+
1


2



(
.
)



+


1
4






p
1



(

z


k
+
2


k

3

)





p
2



(

z


k
+
3


k

3

)






3
4




q
1



(

z


k
+
2


k

2

)





q
2



(

z


k
+
3


k

2

)



+


1
4




p
1



(

z


k
+
2


k

2

)





p
2



(

z


k
+
3


k

3

)








r


k
+
3



k
+
1


3



(
.
)



+


1
4






p
1



(

z


k
+
2


k

2

)





p
2



(

z


k
+
3


k

4

)






3
4




q
1



(

z


k
+
2


k

2

)





q
2



(

z


k
+
3


k

4

)



+


1
4




p
1



(

z


k
+
2


k

2

)





p
2



(

z


k
+
3


k

4

)








r


k
+
3



k
+
1


4



(
.
)




,




(
31
)







where p1(.) need not be calculated at look ahead step k+3, since it is already given from (29).


We may now formulate, for the general case, the estimator for (16) based on the MIS problem:










J


(

u


)


~




i
=

k
+
l
+
1



k
+
l
+
L







[




m
=
1


M
i





1

n
m







g
=
1


n
m







w
~

m



(

z


k
+
l
+
1

:
i


m
,
g


)


·





(



z


k
+
l
+
1

:
i


m
,
g




H


k
+
l



k
+
l




,

u


k
+

l
:

i
-
1





k
+
l




)




q
m



(

z


k
+
l
+
1

:
i


m
,
g


)



·


r
i



(


b


[

X

i


k
+
l



]


,

u


i
-
1



k
+
l





)






]

,







(
32
)







where Mi is the number of distributions at look ahead step i from which measurements are being sampled, nm is the number of measurements sampled from the mth distribution at look ahead step i, qm(.) is the mth distribution at look ahead step i that samples were taken from, and zk+l+1:im,g are the gth set of future measurements at time instances k+l+1:i, sampled from the mth distribution. The mth weight is denoted by {tilde over (w)}m where Σ{tilde over (w)}m=1, and {tilde over (w)}m>0∀m. This estimator (32) is unbiased under the assumption that qm(.)>0 whenever {tilde over (w)}m( )·custom-character(z|H)·ri(.)≠0.


We may make use of an unbiased nearly optimal estimator for (16), based on the multiple importance sampling problem with the balance heuristic (see Appendix A)











J


(

u


)







i
=

k
+
l
+
1



k
+
l
+
L




[


1

n
i







m
=
1


M
i







g
=
1


n
m






w
i



(

z


k
+
l
+
1

:
i


m
,
g


)


·


r
i



(


b


[

X

i


k
+
l



]


,

u


i
-
1



k
+
l





)






]



,




(
33
)







where ni are the number of samples considered at look ahead step i, Mi is the number of distributions at look ahead step i from which measurements are being sampled, nm is the number of measurements sampled from the mth distribution, and following the balance heuristic wm,gi is the likelihood ratio of the gth sample from the mth distribution at look ahead step i given by












w
i



(

z


k
+
l
+
1

:
i


m

g


)


=





(



z



k

l

+
1

:
i


m
,
g


|

H


k
+
l



1
:
i




,

u

k
+

l
:


i
-
1

|

k
+
l






)







m
_

=
1


M
i






n

m
~



n
i





q

m
~




(

z


k
+
l
+
1

:
i


m
,
g


)






,




(
34
)







where zk+l+1:im,g are the gth set of future measurements at time instances k+l+1:i, sampled from the mth distribution, and qth(.) is the {tilde over (m)}th distribution.


When all samples being considered to estimate (16) are sampled from their nominals distributions, (33) is reduced back to the formulation of X-BSP, with all the weights degenerating to ones, for such a case, Mi=1 ∀i, and q1(⋅)=custom-character(zk+l+1:i|Hk+l|k+l, uk+l:i−1|k+l), thus wi=1 ∀i.


3.6 Incremental ML BSP

Since our novel approach changes the solution paradigm for the original, un-approximated, problem (X-BSP), we claim it could be utilized to also reduce computation time of existing approximations of X-BSP. To support our claim, in this section we present the implementation of iX-BSP principles over the most commonly used approximation, ML-BSP, and we denote iX-BSP under the ML assumption as iML-BSP.


Under the ML assumption we consider just the most likely measurements, rather than sampling multiple measurements, hence Mi=1 ∀i, because a single measurement is considered for each action at each time step. Considering the aforementioned, for the case of iML-BSP, Eq. (33) is reduced to











J


(

u


)







i
=

k
+
l
+
1



k
+
l
+
L




[


w
i

·


r
i



(


b


[

X

i
|

k
+
l



]


,





u


i
-
1

|

k
+
l





)



]



,




(
35
)







and Eq. (34), representing the weight at time i is reduced to











w
i

=





(



z


k
+
l
+
1

:
i


|

H


k
+
l

|

k
+
l




,

u

k
+

l
:


i
-
1

|

k
+
l






)



q


(

z


k
+
l
+
1

:
i


)




,




(
36
)







where q(.) is the pdf from which the measurement set zk+l+1:i, was sampled.


3.7 Full Inference and Planning Session


FIG. 7 illustrates a flowchart 700 of belief space planning through incremental expectation updating, summarizing the process described above (Sections 3.0-3.6), according to embodiments of the present invention. FIG. 7 indicates how the standard process of plan-act-measure-infer of BSP is modified by embodiments of the present invention. Instead of calculating BSP from scratch at each planning session, calculations are re-used from past experience (e.g., precursory planning sessions). The calculations to be re-used are stored for subsequent re-use at a preparation step 710. Stored calculations include previously evaluated beliefs, that is, propagated beliefs, together with subsequent belief nodes of planning horizons, including measurement samples. At a planning step 720 a planning session then proceeds with four key sub-steps, as follows.


First from the set of stored propagated beliefs, a “closest” or “best matching” stored propagated belief is selected, as best corresponding to the current (“new”) propagated belief from which planning is proceeding. Stored measurement samples associated with the closest propagated belief are then re-used, after ensuring that the measurement samples provide a representative set of a measurement likelihood distribution of the new propagated belief. Typically, the process of selecting the best match is iterative over time steps of the planning horizon.


For each measurement sample stored with each best matching belief, an information gap is determined (the gap of data between the stored posterior belief associated with the re-use measurement sample and a new posterior belief that would be inferred by applying the re-use measurement sample to the new propagated belief). The posterior beliefs being re-used are then updated to account for the information gaps, and immediate scores for the posterior beliefs are calculated. If there are an insufficient number of re-usable measurement samples, new samples may be sampled as well, from which posterior beliefs are inferred by full calculations without reuse.


Subsequently, for multiple candidate action sequences, extending over the planning horizon, objective values are calculated as weighted sums including the immediate scores of the updated posterior beliefs. An optimal action from among the multiple candidate actions is then selected as the candidate action with the optimal objective value.


The updated and newly calculated beliefs of step 720 can then be added to the stored values of “past experience” of step 710. Then, at steps 730 through 750, the standard BSP processes of act (730), measure (740), and infer (750) are implemented, and the next planning session is then repeated.


Appendix A: Multiple Importance Sampling

Let us assume we wish to express expectation over some function ƒ(x) with respect to distribution p(x), by sampling x from a different distribution q(x),












p



f


(
x
)



=






f


(
x
)


·

p


(
x
)




dx


=







f


(
x
)


·

p


(
x
)




q


(
x
)





q


(
x
)



dx


=




q



(



f


(
x
)


·

p


(
x
)




q


(
x
)



)


.







(
37
)







Eq. (37) presents the basic importance sampling problem, where custom-characterq denotes expectation for x˜q(x). The probability ratio between the nominal distribution p and the importance sampling distribution q is usually referred to as the likelihood ratio. Our problem is more complex, since our samples are potentially taken from M different distributions while M∈[1, (nx·nz)L], i.e. a multiple importance sampling problem













p



f


(
x
)



=



μ
~



(
x
)







m
=
1

M




1

n
m







i
=
1


n
m






w
m



(

x

i

m


)






f


(

x

i

m


)




p


(

x

i

m


)





q
m



(

x
im

)









,




(
38
)







where wm(.) are weight functions satisfying Σm=1Mwm(x)=1. For qm(x)>0 whenever wm(x)p(x)ƒ(x)≠0, Eq. (38) forms an unbiased estimator













[


μ
~



(
x
)


]


=





m
=
1

M






q
m




[


1

n
m







i
=
1


n
m






w
m



(

x

i

m


)






f


(

x

i

m


)




p


(

x

i

m


)





q
m



(

x

i

m


)






]



=



μ
~



(
x
)


.






(
39
)







Although there are numerous options for weight functions satisfying Σm=1Mwm(x)=1, we apply here the Balance Heuristic, considered to be nearly optimal in the sense of estimation variance,












w
m



(
x
)


=



w
m

B

H




(
x
)


=



n
m




q
m



(
x
)





Σ

s
=
1

M



n
s




q
s



(
x
)






.




(
40
)







Using (40) in (38) produces the multiple importance sampling with the balance heuristic












p



f


(
x
)






1
n






m
=
1

M






i
=
1


n
m






p


(

x

i

m


)




Σ

s
=
1

M




n
s

n




q
s



(

x

i

m


)







f


(

x

i

m


)


.









(
41
)







Appendix B: The DPQ Distance

The approach described herein requires a probability density function (pdf) metric. One possibility is to use the DPQ metric which was first suggested as the DPQ2 distance in F. Topsoe, “Some inequalities for information divergence and related measures of discrimination,” IEEE Transactions on information theory, 46 (4):1602-1609, 2000, incorporated herein by reference. Application of the pdf as a metric, was described in Endres, cited above.

  • The DPQ metric is given by












D

P

Q




(

P
,




Q

)


=




1
2




D

K

L




(

P






Q

)



+


1
2




D

K

L




(

Q






P

)






,




(
42
)







where P and Q are probability density functions and DKL(P∥Q) is the Kullback-Leibler (KL) divergence.

  • The Kullback-Leibler (KL) divergence, sometime referred to as relative entropy, measures how well some distribution Q approximates distribution P, or in other words how much information will be lost if one considers distribution Q instead of P. The KL divergence is not a metric (asymetric) and is given by











D

K

L




(

P
|
|
Q

)


=





P
·
log



P
Q



=



E
P



[


log





P

-

log





Q


]


.






(
43
)







From a view point of Bayesian Inference, the DKL(P∥Q) metric can be interpreted as twice the expected information gain when deciding between P and Q given a uniform prior over them.

  • For the special case of Gaussian distributions, we can express DKL(P∥Q) and consequently DPQ(P, Q) in terms of means and covariances. Let us consider two multivariate Gaussian P and Q in custom-characterd, with means μp and μq and covariances Σp and Σq respectively.











D
KL



(

P






Q

)


=




P



[


log





P

-

log





Q


]


=



1
2





P



[



-
log






p




-



(

x
-

μ
p


)

T





p

-
1




(

x
-

μ
p


)



+

log





q




+



(

x
-

μ
q


)

T





q

-
1




(

x
-

μ
q


)




]



=




1
2


log










q







p





+


1
2





P



[



-


(

x
-

μ
p


)

T






p

-
1




(

x
-

μ
p


)



+



(

x
-

μ
q


)

T





q

-
1




(

x
-

μ
q


)




]




=




1
2


log






q







p





+


1
2





P



[


-

tr


(



p

-
1






(

x
-

μ
p


)

T



(

x
-

μ
p


)



)



+

tr


(



q

-
1






(

x
-

μ
q


)

T



(

x
-

μ
q


)



)



]




=




1
2


log










q







p





-


1
2



tr


(



p

-
1





p


)



+


1
2





P



[

tr


(



q

-
1




(


xx
T

-

2

x






μ
q
T


+


μ
q



μ
q
T



)


)


]




=




1
2


log










q







p





-


1
2


d

+


1
2



tr


(



q

-
1




(




p




+

μ
p




μ
p
T



-

2






μ
p



μ
q
T


+


μ
q



μ
q
T



)


)




=



1
2



[


log










q







p





-
d
+

tr


(



q

-
1





p


)


+

tr


(



q

-
1






(


μ
p

-

μ
q


)

T



(


μ
p

-

μ
q


)



)



]


=


1
2



[


log










q







p





-
d
+

tr


(



q

-
1





p


)


+



(


μ
p

-

μ
q


)

T





q

-
1




(


μ
p

-

μ
q


)




]













(
44
)







Substituting Eq. (44) in Eq. (42) we get the DPQ metric representation for the multivariate Gaussian case,











D

P

Q




(

P
,




Q

)


=








1
2



[


l

o

g



|

Σ
q

|


|

Σ
p

|



-

2

d

+

t


r


(


Σ
q

-
1




Σ
p


)



+




(


μ
p

-

μ
q


)

T



[


Σ
q

-
1


+

Σ
p

-
1



]




(


μ
p

-

μ
q


)


+

log







|

Σ
p

|


|

Σ
q

|



+

t


r


(


Σ
p

-
1




Σ
q


)




]



=


1
4









(


μ
p

-

μ
q


)

T



[



q

-
1




+


p

-
1




]




(


μ
p

-

μ
q


)


+

tr


(



q

-
1





p


)


+

tr


(



p

-
1





q


)


-

2

d



.







(
45
)

















APPENDIX C








Algorithms




















Algorithm 1 Sampling zi|k ~ custom-character  (zi|k|Hi|k)




 1: χi = {xi, custom-character  } ~ custom-character  (xi, custom-character  |Hi|k)




 2: Determine data association custom-characteri|k(xi, custom-character  )




 3: zi|k = {zi,j|k}custom-character i|k(χi) with zi,j|k ~ custom-character  (zi,j|k|xi,lj)




 4: return zi|k and χi






















Algorithm 2 iX-BSP: Planning time k+l

















Input:



  data             custom-character   Calculations used for the precursory planning session



  b[Xk+l|k+l]              custom-character  The up-to-date inference posterior for time k + l



1: Dist , {tilde over (b)}[Xk+l|k] ← SELECTCLOSESTBRANCH(b[Xk+l|k+l], data)    custom-character  see Section 3.3



2: if Dist ≤ ϵc then                     custom-character  belief distance threshold ϵc



3:   if useWF ∩ (Dist ≤ ϵwf) then                  custom-character  wildfire threshold ϵwf



4:    data ← custom-characterk+l|k      custom-character  Reusing the entire selected branch without any update



5:   else



6:    data ← REUSEEXISTINGBELIEFS({tilde over (b)}[Xk+l|k])            custom-character  see Section 3.4



7:   end if



8:   data ← perform X-BSP over horizon steps k + L + 1 : k + L + l



9:   Solve Eq. 11, for each candidate action                custom-character  see Section 3.5



10:   uk+i:k+L|k+i ← find best action



11: else



12:   uk+i:k+L|k+i ← perform X-BSP(b[Xk+l|k+l])



13: end if



14: return uk+i:k+L|k+i, data





















Algorithm 3 reuseExistingBeliefs

















Input:



  {tilde over (b)}[Xk+l|k]            custom-character  The root node of the selected branch, see Section 3.3



1: for each i ∈ [k + l, k + L − 1] do             custom-character  each overlapping horizon step



2:   for each s ∈ [1, nu (nx·nz)i−k−l] do        custom-character  each belief in the ith horizon step



3:     if useWF ∩ ISWILDFIRE(bs[Xi|k+l) then



4:       r0 ← (s − 1) · nu · (nx · nz)



5:       {br[Xi+1|k+l]}r=r0+1r0+nu·(nx·nz) ← all first order children of bs′[Xi|k]  custom-character  wildfire condition



6:       mark all {br[Xi+1|k+l]}r=r0+1r0+nu·(nx·nz) as wildfire



7:     else



8:       for each candidate action α ∈ [1, nu] do



9:         bαs−[Xi+1|k+l] ← propagate bs[Xi|k+l] with candidate action α



10:         dist , bαs′−[Xi+1|k] ← CLOSESTBELIEF( custom-characterk+l|k, bαs′−[Xi+1|k+l])  custom-character  see Section 3.4.1



11:        if dist ≤ ϵc then                      custom-character  re-use condition



12:         if useWF ∩ (dist ≤ ϵwf) then             custom-character  wildfire condition



13:           {bαr[Xi+1|k+l]}r=1nx·nz ← all first order chilren of bαs′−[Xi+1|k]  custom-character  wildfire



 condition



14:           mark {bαr[Xi+1|k+l]}r=1nx·nz as wildfire



15:           Continue with next candidate action (i.e. jump to line 8)



16:         else



17:           samples ← all samples taken from bαs′−[Xi+1|k]



18:           {repSamples}1nxnz, data ← ISREPSAMPLE(samples, bαs′−[Xi+1|k+l])  custom-character



 see Section



19:         end if



20:        else         custom-character  not computationally effective to re-use, resample all



21:         {repSamples}1nxnz, data ← (nx·nz) fresh samples based on bαs′−[Xi+1|k+l] custom-character



 see Alg. 1



22:        end if



23:        data ← UPDATEBELIEF(dist, {repSamples}1nxnz, data)   custom-character  see Section



24:        data ← update objective value for α         custom-character  see Section 3.4.4



25:      end for



26:    end if



27:  end for



28: end for



29: return data





















Algorithm 4 ClosestBelief

















Input:



  custom-characterk+l|k     custom-character  set of candidate beliefs for re-use from planning at time k, see Section 3.3



  b[Xi+1|k+l]        custom-character  The belief to check distance to, from planning at time k + l



1: δmin = 0



2: for b[Xi+1|k] ∈ custom-characterk+l|k do



3:   δ ← BELIEFMETRIC(b[Xi+1|k], b[Xi+1|k+l])  custom-character  probability metric to determine belief distance



4:   if δ ≤ δmin then               custom-character  keeping track over the shortest distance



5:   if δmin ← δ



6:   b′[Xi+1|k] ← b[Xi+1|k]



7:  end if



8: end for



9: return δmin, b′[Xi+1|k]





















Algorithm 5 IsRepSample

















Input:



  samples   custom-character  set of candidate samples for re-use from planning at time k, see Alg 3 line 17



  b[Xi|k+l]   custom-character  The belief from planning time k + l the samples should be representing, i.e.



sampled from



1: Given βσ = 1.5       custom-character  User determined Heuristic, in direct proportion to acceptance



2: for each sample ∈ samples do



3:   if sample ⊂ ±βσ·σ of b[Xi|k+l] then



4:     flag sample for re-use   custom-character  The sample falls within ±βσ·σ range, hence accepted



5:   else           custom-character  The sample falls outside the ±βσ·σ range, hence rejected



6:     re-sample measurement using b[Xi|k+l]       custom-character  freshly sampled, see Alg. 1



7:   end if



8: end for



9: return {repSamples}1n, {q(.)1n        custom-character  {q(.)}1n represent the PDFs from which the



 {repSamples}1n were taken








Claims
  • 1. A system of decision making under uncertainty, for selecting an optimal action from among multiple candidate actions, the system comprising a processor and an associated memory storing instructions, which when executed by the processor implement steps comprising: A) from a current belief, inferring multiple new propagated beliefs according to the respective multiple candidate actions;B) accessing a stored set of propagated beliefs, wherein each given stored propagated belief is associated with one or more stored measurement samples and with one or more respective stored posterior beliefs inferred from the given stored propagated belief according to a respective measurement sample, wherein the stored propagated beliefs were propagated from prior beliefs during one or more precursory planning sessions at one or more respective previous times;C) for each new propagated belief generated for a respective candidate action: selecting from the set of stored propagated beliefs a closest stored propagated belief; selecting, from among the one or more stored measurement samples associated with the closest stored propagated belief, re-use measurement samples for a representative set of a measurement likelihood distribution corresponding to the new propagated belief;D) for each re-use measurement sample: determining an information gap between the stored posterior belief associated with the re-use measurement sample and a new posterior belief that would be inferred by applying the re-use measurement sample to the new propagated belief; responsively updating the stored posterior belief to account for the information gap and associating the new propagated belief with the updated posterior belief; calculating an immediate score for the updated posterior belief associated with the new propagated belief;E) subsequently, calculating objective values for the multiple candidate actions, wherein the calculation of the objective values is a weighted summation including the immediate scores of the updated posterior beliefs; andF) subsequently, determining the optimal action from among the multiple candidate actions according to the candidate action with the optimal objective value.
  • 2. The system of claim 1, further comprising newly sampling one or more additional measurement samples for one or more of the new propagated beliefs, inferring from the one or more additional measurement samples respective one or more additional posterior beliefs, calculating additional immediate scores for the one or more additional posterior beliefs; and wherein the calculation of the objective values is a summation including the immediate scores of the updated posterior beliefs and of the additional posterior beliefs.
  • 3. The system of claim 2, wherein, selecting re-use measurement samples for the representative set comprises determining that the re-use measurement samples are an inadequate representative measurement set and that adding the additional measurement samples provides an adequate representative measurement set.
  • 4. The system of claim 1, wherein selecting the re-use measurement samples for the representative set comprises determining which measurement samples are within a pre-determined variance range of the measurement likelihood distribution.
  • 5. The system of claim 1, further comprising determining that the information gap is less than a wildfire threshold and wherein updating the stored posterior belief to account for the information gap comprises making no updating calculations of the stored posterior belief.
  • 6. The system of claim 5, wherein the candidate actions are sequences of actions over a planning horizon, which generate respective sequences of propagated beliefs branching from the new propagated belief, wherein the stored propagated beliefs include planning horizons including stored branches of stored propagated beliefs and associated measurements, and wherein the steps further include, upon determining that the information gap is less than a wildfire threshold, associating a stored branch of the closed stored propagated belief with the new propagated belief, with no updating of stored posterior beliefs of the stored branch.
  • 7. The system of claim 1, wherein the stored propagated beliefs include planning horizons including stored branches of stored propagated beliefs and associated measurements, wherein a stored branch of the closest propagated belief has a planning horizon of L1, wherein a time of the closest propagated belief is k, such that the last posterior beliefs of the stored branch are associated with a time k+L1, wherein the multiple candidate actions have a planning horizon of L2, wherein a time of the new propagated belief is k+1, and wherein the steps of the system further comprise: sampling new measurements between times k+L1 and k+L2+l; responsively inferring additional posterior beliefs; calculating for the additional posterior beliefs respective additional immediate scores, and calculating the objective values by a weighted summation incorporating the additional immediate scores.
  • 8. The system of claim 1, wherein the immediate score reflects a cost that is a function of the posterior belief and the candidate action, and wherein the optimal objective value is a minimum objective value of the candidate actions.
  • 9. The system of claim 1, wherein the immediate score reflects a reward that is a function of the posterior belief and the candidate action, and wherein the optimal objective value is a maximum objective value of the candidate actions.
  • 10. The system of claim 1, wherein the one or more measurement samples associated with each propagated belief space are one sampled measurement that is selected as the maximum likelihood measurement.
  • 11. The system of claim 1, wherein determining the information gap comprises performing data association (DA) matching to determine measurements with DA to update, to add or to remove, and wherein updating the stored posterior belief comprises performing corresponding DA modifications to generate updated or added measurements and responsively to update one or more corresponding measurement values.
  • 12. A method of decision making under uncertainty, for selecting an optimal action from among multiple candidate actions, the method implemented by a processor and an associated memory storing instructions, which when executed by the processor implement steps of the method comprising: A) from a current belief, inferring multiple new propagated beliefs according to the respective multiple candidate actions;B) accessing a stored set of propagated beliefs, wherein each given stored propagated belief is associated with one or more stored measurement samples and with one or more respective stored posterior beliefs inferred from the given stored propagated belief according to a respective measurement sample, wherein the stored propagated beliefs were propagated from prior beliefs during one or more precursory planning sessions at one or more respective previous times;C) for each new propagated belief generated for a respective candidate action: selecting from the set of stored propagated beliefs a closest stored propagated belief; selecting, from among the one or more stored measurement samples associated with the closest stored propagated belief, re-use measurement samples for a representative set of a measurement likelihood distribution corresponding to the new propagated belief;D) for each re-use measurement sample: determining an information gap between the stored posterior belief associated with the re-use measurement sample and a new posterior belief that would be inferred by applying the re-use measurement sample to the new propagated belief; responsively updating the stored posterior belief to account for the information gap and associating the new propagated belief with the updated posterior belief; calculating an immediate score for the updated posterior belief associated with the new propagated belief;E) subsequently, calculating objective values for the multiple candidate actions, wherein the calculation of the objective values is a weighted summation including the immediate scores of the updated posterior beliefs; andF) subsequently, determining the optimal action from among the multiple candidate actions according to the candidate action with the optimal objective value.
  • 13. The method of claim 12, further comprising newly sampling one or more additional measurement samples for one or more of the new propagated beliefs, inferring from the one or more additional measurement samples respective one or more additional posterior beliefs, calculating additional immediate scores for the one or more additional posterior beliefs; and wherein the calculation of the objective values is a summation including the immediate scores of the updated posterior beliefs and of the additional posterior beliefs.
  • 14. The method of claim 12, wherein selecting the re-use measurement samples for the representative set comprises determining which measurement samples are within a pre-determined variance range of the measurement likelihood distribution.
  • 15. The method of claim 12, wherein the candidate actions are sequences of actions over a planning horizon, which generate respective sequences of propagated beliefs branching from the new propagated belief, wherein the stored propagated beliefs include planning horizons including stored branches of stored propagated beliefs and associated measurements, and wherein the steps further include, upon determining that the information gap is less than a wildfire threshold, associating a stored branch of the closed stored propagated belief with the new propagated belief, with no updating of stored posterior beliefs of the stored branch.
  • 16. The method of claim 12, wherein the immediate score reflects a cost that is a function of the posterior belief and the candidate action, and wherein the optimal objective value is a minimum objective value of the candidate actions.
  • 17. The method of claim 12, wherein the immediate score reflects a reward that is a function of the posterior belief and the candidate action, and wherein the optimal objective value is a maximum objective value of the candidate actions.
  • 18. The method of claim 12, wherein the one or more measurement samples associated with each propagated belief space are one sampled measurement that is selected as the maximum likelihood measurement.
  • 19. The method of claim 12, wherein determining the information gap comprises performing data association (DA) matching to determine measurements with DA to update, to add or to remove, and wherein updating the posterior belief comprises performing corresponding DA modifications to generate updated or added measurements and responsively to update one or more corresponding measurement values.
CROSS-REFERENCE TO RELATED APPLICATIONS

This application claims the benefit of U.S. Provisional Patent Application No. 62/831,812, filed Apr. 10, 2019, the contents of which are incorporated herein by reference.

Provisional Applications (1)
Number Date Country
62831812 Apr 2019 US