The present invention relates to an estimation method, an estimation apparatus, and a program.
These days, with the development of sensors and information communication technologies, various data have been collected on a large scale; however, it is often the case that not individual data but aggregated data alone is available due to consideration for privacy, difficulty in observation, etc. For example, human position information obtained by observing radio waves from a GPS (Global Positioning System) satellite may be provided as time-based area population data in which individuals cannot be tracked in consideration for privacy. The time-based area population data is information indicating the number of people in each area at each time step.
A model called a CGM (Collective Graphical Model) is known and widely used as a model with which even in a situation where only thus aggregated data (hereinafter, also referred to as aggregate data) is available, deeper information can be extracted from the data by performing interpolation, estimation, and learning on the basis of probabilistic modeling (see, for example, Non Patent Literature 1).
Among operations in the CGM, there is an operation in which when observed aggregate data and potentials of a graphical model present behind it are given, a true contingency table of the graphical model is estimated by MAP estimation (maximum a posteriori estimation). The MAP estimation is essential for interpolation and estimation, and is a very important operation used also as a subroutine of learning; thus, has been variously studied heretofore.
Non Patent Literature 1: D. R. Sheldon and T. G. Dietterich. Collective Graphical Models. In Proceedings of the 24th International Conference on Neural Information Processing Systems, pp. 1161-1169, 2011
Meanwhile, when estimating a true contingency table of a graphical model by MAP estimation, it is necessary to solve a MAP estimation problem. At this time, conventionally, a technique of solution by applying Stirling's approximation and continuous relaxation has been mainly used. However, this technique may have low solution accuracy or low solution interpretability.
An embodiment of the present invention has been made in view of the above points, and an object of the present invention is to obtain a MAP estimation solution with high accuracy and high interpretability.
Aiming to achieve the above object, an estimation method according to an embodiment is an estimation method that estimates a MAP solution of a CGM on a path graph, in which a computer executes: an input procedure that receives, as inputs, aggregate data and potentials of the CGM on the path graph; an estimation procedure that uses the aggregate data and the potentials to solve a MAP estimation problem of the CGM by a technique based on discrete DC programming and calculates a MAP estimation solution; and an output procedure that outputs the MAP estimation solution.
A MAP estimation solution with high accuracy and high interpretability can be obtained.
Hereinbelow, an embodiment of the present invention is described. In the present embodiment, a MAP estimation apparatus 10 is described with which when observed aggregate data and potentials of a graphical model are given and it is attempted to estimate a true contingency table of the graphical model by MAP estimation, a MAP estimation solution with high accuracy and high interpretability can be obtained.
<CGM and MAP Estimation>
First, a CGM and MAP estimation are briefly described.
Letting H=(N, A) be an undirected graph, a graphical model expressed by a probability mass function like below will now be considered. Note that N is a set of vertices, and A is a set of edges.
where φij(xi, xj|θ) is a local potential defined for variables (Xi, Xj). The local potential is controlled by a parameter vector θ, and Z(θ) is a normalization constant (partition function). It is assumed that the random variable Xi takes a value in a finite set
xi [Math. 2]
Samples from this graphical model are represented by x(1), . . . , x(M). A contingency table ni regarding a vertex i and a contingency table nij regarding an edge (i, j) are defined as follows, respectively.
provided that
(⋅) [Math. 5]
is an indicator function. Further, a vector in which the contingency table ni regarding the vertex i and the contingency table nij regarding the edge (i, j) are collected and arranged for all the vertices and all the edges is represented by n.
At this time, a distribution of n (this is referred to as a CGM distribution) can be written as
where νi is the degree of the vertex i.
Further, it is assumed that an observation value y (that is, observed aggregate data) is generated from some probability distribution p(y|n) indicating observation noise. Specifically, it is assumed that the value ni is observed at each vertex in accordance with the following observation noise model.
provided that it is assumed that
−log pi,x
is a convex function with respect to ni(xi). A posterior distribution of n is given by p(n|y; θ)∝p(n; θ)·p(y|n). Then, a MAP estimation problem is expressed as maxnp(n|y; θ).
The above MAP estimation problem will now be modified. When the MAP estimation problem is regarded as a minimization problem of −log p(n|y; θ), the following minimization problem is to be solved.
Hereinafter, for simplicity, it is assumed that, for i=1, 2, . . . , |N|,
x
i={1, 2, . . . , R}. [Math. 11]
Note that this is an example, and the following description can be easily extended to other cases.
The MAP estimation problem shown in Math. 9 above is generally considered to be NP-hard, and is therefore very difficult to solve efficiently. For this reason, conventionally, a technique in which Stirling's approximation and continuous relaxation (that is, removal of the restriction of taking only integer values) are applied to the objective function and message passing is used to make solution, or the like has been mainly used. The continuously relaxed problem is a convex programming problem, and a global optimum solution can be obtained by message passing (see, for example, Reference 1).
Reference 1: T. Sun, D. R. Sheldon and A. Kumar. Message Passing for Collective Graphical Model. In Proceedings of the 32nd International Conference on Machine Learning, pp. 853-861, 2015.
However, the conventional technique described above has the following two problems of (1) and (2).
log x!≠x log x−x [Math. 12]
is not accurate when x is small.
Thus, in the present embodiment, a case is described where the MAP estimation of a CGM in a path graph is brought down to a minimum cost flow problem on a network and discrete DC (Difference of Convex) programming is applied to this problem to efficiently solve the estimation of n without approximation or continuous relaxation. Thereby, the MAP estimation apparatus 10 according to the present embodiment can output an accurate solution even when the total number of samples M is small, and can output a sparse solution (that is, a solution with high interpretability). The CGM in a path graph is important because it can express a time-series model (Markov model) and therefore has a wide application range.
Note that the discrete DC programming is a technique of efficiently optimizing a function expressed in the form of a difference between two discrete convex functions (see, for example, Reference 2).
Reference 2: T. Maehara, K. Murota, A framework of discrete DC programming by discrete convex analysis. Mathematical Programming, Series A, vol. 152, no. 1-2, pp. 435-466.
<Theoretical Configuration of MAP Estimation According to the Present Embodiment>
Next, a theoretical configuration when the MAP estimation apparatus 10 according to the present embodiment performs MAP estimation is described. As described above, the MAP estimation apparatus 10 according to the present embodiment brings the MAP estimation of a CGM in a path graph down to a minimum cost flow problem on a network, and then applies discrete DC programming to this problem to make solution; thus, obtains a MAP estimation solution.
First, the minimum cost flow problem is described. The (non-linear) minimum cost flow problem is a problem like below. A directed graph G=(V, E) is given as an input, and for each edge (i, j)∈E,
Capacity constraint uij∈≥0
Cost function cij:≥0→ [Math. 13]
are allocated. Further, for each vertex i∈V,
Demand bi∈≥0 [Math. 14]
is given. The minimum cost flow problem is a problem of obtaining a flow having the minimum cost among flows satisfying the capacity constraint of each edge and the demand constraint of each vertex.
That is, when the flow flowing through the edge (i, j)∈E is represented by xij, the minimum cost flow problem can be formulated as follows.
Here, when the cost function satisfies a convex cost condition of Cij(x+2)+cij(x)≥2·cij(x+1) in all the edges (i, j)∈E, the minimum cost flow problem shown in Math. 15 above is called a minimum convex cost flow problem, and is known to have an efficient solution method.
Next, a method for creating an instance of a minimum cost flow problem is described. For ease of writing, the notation of
n
ijk
:=n
i, i+1(j, k)
φijk:=φi, i+1(j, k)
n
i
:=n
i(j)
y
i
:=y
i(j)
is used. When the graphical model is of a path type,
and thus the objective function is
Further, letting
f
ijk(z):=log z!−z·log φijk
g(z):=−log z!
h
ij(z):=−log [pij(yij|z)],
the objective function can be rewritten as
In order to turn the optimization problem estimation problem) shown in Math. 9 above into a minimum cost flow problem, a minimum cost flow problem on the graph G=(V, E) is constructed by (a) to (g) below. Note that in the following description, the edge (c, u) represents an edge in which the cost function is c and the capacity constraint is u. Further, [m]=(1, 2, . . . , m) is given for any natural number m.
V={o}∪((1)∪(1))∪((2)∪(2)). . . ∪((|N|)∪(|N|))∪{d} [Math. 19]
where
(i)
:={u
j
(i)}j=1R; (i):={vj(i)}j=1R [Math. 20].
Further, o is the vertex forming the start point of the flow, and d is the vertex forming the end point of the flow.
v∈V\{o, d} [Math. 21]
As an example, an instance of a minimum cost flow problem constructed by (a) to (g) above when |N|=3 and R=3 is shown in
Then, from an optimum solution of a minimum cost flow problem constructed by (a) to (g) above, a solution n* of the optimization problem shown in Math. 9 above is composed as follows.
nijk*=(Flow rate through an edge from vertex vj(i) to vertex uk(i+1))
nij*=(Flow rate through an edge from vertex uj(i) to vertex vj(i)) [Math. 22]
At this time, n* is an optimum solution of the optimization problem shown in Math. 9 above. Thus, it can be seen that the MAP estimation of a CGM can be made by solving a minimum cost flow problem constructed by (a) to (g) above.
Although the minimum cost flow problem constructed by (a) to (g) above does not satisfy the convex cost condition, the minimum cost flow problem has a feature that the cost function of the edge has succeeded in being explicitly separated into a convex function and a concave function. Utilizing this, the minimum cost flow problem constructed by (a) to (g) above will now be solved by a technique based on discrete DC programming.
The minimum cost flow problem constructed by (a) to (g) above can be formulated as an optimization problem like below.
[Math. 25]
is a set of feasible integer solutions of the minimum cost flow problem constructed by (a) to (g) above.
In order to solve the optimization problem shown in Math. 23 above, in a row of feasible solutions zo, z1, . . . , zt, . . . , solutions satisfying
(z0)≥(z1)≥ . . . ≥(zt) . . . [Math. 26]
will now be generated. The initial point
z0∈ [Math. 27]
is arbitrarily taken. When z0, z1, . . . , zt are already determined, for
(Condition 1) (zt)=(zt)
(Condition 2) (z)≥(z), ∀z∈
(Condition 3) t(z):=(z)+(z) [Math. 28]
can be efficiently minimized in N,
a function satisfying all of these,
[Math. 29]
is found. Then, based on
a new point zt+1 is determined. At this time,
holds; thus, it can be seen that the objective function monotonically decreases. Since the set of feasible solutions is a finite set, zt is a local optimum solution at a sufficiently large t.
A method for determining, for zt,
[Math. 32]
will now be described. For any
w∈≥0 [Math. 33]
w(z):=αw·(w−z)−log (z!) [Math. 34]
is given, where αw is any real number satisfying −log (w+1)≤αw≤−log w. At this time,
w(w)=g(w)
w(z)≥g(z) ∀z∈ [Math. 35]
hold. Thus, when, for zt,
is set,
(zt)=(zt)
(z)≥(z), ∀z∈, [Math. 37]
and conditions 1 and 2 above are satisfied. Further, the problem of minimizing
(z):=(z)+(z) [Math. 38]
in a set of feasible solutions,
, [Math. 39]
is a minimum convex cost flow problem; thus, a minimum solution can be efficiently obtained, and condition 3 above is satisfied. Thus,
:=t [Math. 40]
may be set.
In view of the above, in order to solve a minimum cost flow problem constructed by (a) to (g) above, a technique based on discrete DC programming is used and the cost function of an edge stretched from vertex uj(i) to vertex vj(i) in graph G is replaced with
h
ij(z)+
(in other words, the cost function is corrected (or approximated)), and then the minimum convex cost flow problem shown in Math. 30 above is solved to obtain zt+1. Then, when zt+1 converges (that is, when the value of the objective function does not change), zt+1 at this time is outputted as an optimum solution of the minimum cost flow problem constructed by (a) to (g) above (that is, an optimum solution (MAP estimation solution) of the optimization problem shown in Math. 9 above).
<Hardware Configuration>
Next, a hardware configuration of the MAP estimation apparatus 10 according to the present embodiment is described with reference to
As shown in
The input device 11 is, for example, a keyboard, a mouse, a touch panel, or the like. The display device 12 is, for example, a display or the like. The MAP estimation apparatus 10 may not include both of the input device 11 and the display device 12, for example.
The external I/F 13 is an interface with an external device such as a recording medium 13a. The MAP estimation apparatus 10 can perform reading, writing, etc. of the recording medium 13a via the external I/F 13. Examples of the recording medium 13a include a CD (compact disc), a DVD (digital versatile disk), an SD memory card (Secure Digital memory card), a USB (Universal Serial Bus) memory card, and the like.
The communication I/F 14 is an interface for connecting the MAP estimation apparatus 10 to a communication network. The processor 15 is, for example, an arithmetic device of various types such as a CPU (central processing unit) and a GPU (graphics processing unit). The memory device 16 is, for example, a storage device of various types such as an HDD (hard disk drive), an SSD (solid state drive), a RAM (random access memory), a ROM (read-only memory), and a flash memory.
By having the hardware configuration shown in
<Functional Configuration>
Next, a functional configuration of the MAP estimation apparatus 10 according to the present embodiment is described with reference to
As shown in
Further, the MAP estimation apparatus 10 according to the present embodiment includes a potential storage unit 201, an aggregate data storage unit 202, and a MAP estimation solution storage unit 203. These units are obtained by using, for example, the memory device 16. At least one storage unit among these units may be obtained by using, for example, a storage device (a database server or the like) connected to the MAP estimation apparatus 10 via a communication network.
The input unit 101 stores (local) potentials and aggregate data provided to the MAP estimation apparatus 10 in the potential storage unit 201 and the aggregate data storage unit 202, respectively. The input unit 101 may perform correction, etc. of potentials stored in the potential storage unit 201 or aggregate data stored in the aggregate data storage unit 202 in accordance with, for example, an operation or the like from the input device 11.
The instance construction unit 102 constructs (an instance of) a minimum cost flow problem on the basis of (a) to (g) above by using potentials stored in the potential storage unit 201 and aggregate data stored in the aggregate data storage unit 202.
The MAP estimation unit 103 calculates a MAP estimation solution from a minimum cost flow problem constructed by the instance construction unit 102, and stores the MAP estimation solution in the MAP estimation solution storage unit 203. Here, the MAP estimation unit 103 includes a correction unit 111 and a capacity scaling unit 112. The correction unit 111 corrects (or replaces or approximates) the cost function of an edge stretched from vertex uj(i) to vertex vj(i) in graph G. The capacity scaling unit 112 solves the minimum convex cost flow problem shown in Math. 30 above to calculate a solution n*=zt+1.
The output unit 104 outputs a MAP estimation solution n* stored in the MAP estimation solution storage unit 203 to an arbitrary output destination.
The potential storage unit 201 stores potentials φijk given to the MAP estimation apparatus 10. An example of potentials φijk stored in the potential storage unit 201 is shown in
The aggregate data storage unit 202 stores aggregate data yij provided to the MAP estimation apparatus 10. An example of aggregate data yij stored in the aggregate data storage unit 202 is shown in
The MAP estimation solution storage unit 203 stores a MAP estimation solution n* calculated by the MAP estimation unit 103. An example of the MAP estimation solution n* stored in the MAP estimation solution storage unit 203 is shown in
<MAP Estimation Processing>
Next, a flow of MAP estimation processing according to the present embodiment is described with reference to
First, the instance construction unit 102 reads potentials φijk stored in the potential storage unit 201 and aggregate data yij stored in the aggregate data storage unit 202 (step S101).
Next, the instance construction unit 102 constructs an instance of a minimum cost flow problem on the graph G=(V, E) on the basis of (a) to (g) above by using the potentials φijk and the aggregate data yij read in step S101 above (step S102).
Next, the MAP estimation unit 103 solves the minimum cost flow problem constructed in step S102 above by a technique based on discrete DC programming, and calculates a MAP estimation solution (step S103). The MAP estimation solution calculated by the MAP estimation unit 103 is stored in the MAP estimation solution storage unit 203. Details of the processing of step S103 will be described later.
Then, the output unit 104 outputs the MAP estimation solution stored in the MAP estimation solution storage unit 203 to an arbitrary output destination (step S104). Examples of the output destination of the MAP estimation solution include another device or program connected via a communication network, the display device 12 such as a display, etc.
Here, the processing of step S102 above (the processing of calculating a MAP estimation solution by a technique based on discrete DC programming) is described with reference to
First, the MAP estimation unit 103 performs initialization of z0←0 and t←0 (step S201).
Next, the MAP estimation unit 103 executes the processing of repeating step S301 to step S304 below (step S202). This repetition processing is repeatedly executed until YES is obtained as decision in step S303.
The correction unit 111 of the MAP estimation unit 103 replaces the cost function of an edge stretched from vertex uj(i) to vertex vj(i) in graph G with
h
ij(z)+
(step S301).
Next, the capacity scaling unit 112 of the MAP estimation unit 103 solves the minimum convex cost flow problem shown in Math. 30 above by capacity scaling, and calculates a solution zt+1 (step S302). Note that the capacity scaling is an algorithm for efficiently solving the minimum convex cost flow problem (see, for example, Reference 3).
Reference 3: R. K. Ahuja, T. L. Magnanti, J. B. Orlin, Network Flows: Theory, Algorithms, Applications, Prentice Hall, 1993.
Next, the MAP estimation unit 103 decides whether
(zt)=(zt+1) [Math. 43]
is satisfied or not (step S303). That is, the MAP estimation unit 103 decides whether the solution zt+1 has converged or not.
In the case where it is not decided in step S303 above that the solution zt+1 has converged (NO in step S303), the MAP estimation unit 103 sets t←t+1 (step S304), and then returns to step S301. On the other hand, in the case where it is decided in step S303 above that the solution zt+1 has converged (YES in step S303), the MAP estimation unit 103 outputs (stores) the solution zt+1 as a map estimation solution n* to the MAP estimation solution storage unit 203 (step S203).
The present invention is not limited to the above specifically disclosed embodiment, and various modifications and changes, combinations with known technologies, etc. can be made without departing from the scope of the claims.
Filing Document | Filing Date | Country | Kind |
---|---|---|---|
PCT/JP2020/041426 | 11/5/2020 | WO |