The present invention relates to an estimation apparatus, an estimation method, and a program, and particularly to an estimation apparatus, an estimation method, and a program for estimating parameters of a mixture model from censored data.
The “censored data” refers to data, for which an observed value of a sample is at or above (or below) a certain threshold, a value of the sample is not observed, and only information that the value is above the threshold is obtained. Many pieces of data are expressed as the censored data, such as clinical data descriptive of disease onset, human death, and the like, contract history data of the Internet line user, and service usage history data of an e-commerce site. A representative example of a problem in using the censored data is a survival time analysis that estimates a distribution of time required until a device fails, for example. Since a survival time distribution is often multimodal due to the presence of initial failure, deterioration failure, and the like, distribution estimation using a mixture model is widely used.
In order to estimate the parameters of the mixture model from the censored data, an Expectation Maximization for Censored Mixturemodels (EMCM) algorithm proposed in the document (NPL 1) can apply.
However, this approach has the following two problems.
A first problem is the presence of a local optimum solution. Specifically, the first problem is that in the EMCM, while monotonic decrease of an objective function is guaranteed, a convergence destination changes depending on an initial value, and thus, repeated execution from different initial values is required.
A second problem is the need to calculate statistics of truncated distributions (distribution changed to take values only in a certain range) of a probability distribution utilized in a model. Specifically, the second problem is that, except for exceptions such as a one-dimensional truncated normal distribution that is a truncated distribution of a one-dimensional normal distribution, the statistics cannot be analytically calculated, and thus, it is necessary to utilize numerical calculations such as the Monte Carlo method. Since the EMCM is an algorithm that iterates parameter update and calculation of the statistics many times, it may be desirable to avoid repeating numerical calculations in each iteration.
The present invention has been made in view of the above, and has an object to provide an estimation apparatus, an estimation method, and a program capable of suppressing a computation time and accurately estimating parameters of a model representing a probability distribution of censored data.
An estimation apparatus according to the present invention is an estimation apparatus for estimating parameters of a model representing a probability distribution of censored data, the censored data including observation data of a sample with an observed value being observed, observation data of a sample with no observed value being observed, and a variable representing whether or not an observed value is observed for each sample, the estimation apparatus including: an input unit configured to receive input of the censored data; and a parameter estimation unit configured to estimate the parameters of the model by optimizing an objective function that is a divergence between a model representing a probability distribution of the censored data expressed using a probability density function of observation data, the probability density function being represented by a mixture model of each component representing a distribution of observed values corresponding to each sample of the censored data received by the input unit, and a probability distribution of the censored data obtained from the censored data received by the input unit.
An estimation method according to the present invention is an estimation method for estimating parameters of a model representing a probability distribution of censored data, the censored data including observation data of a sample with an observed value being observed, observation data of a sample with no observed value being observed, and a variable representing whether or not an observed value is observed for each sample, the estimation method including: receiving, at an input unit, input of the censored data; and estimating, at a parameter estimation unit, the parameters of the model by optimizing an objective function that is a divergence between a model representing a probability distribution of the censored data expressed using a probability density function of observation data, the probability density function being represented by a mixture model of each component representing a distribution of observed values corresponding to each sample of the censored data received by the input unit, and a probability distribution of the censored data obtained from the censored data received by the input unit.
A program according to the present invention is a program causing a computer to execute processing of estimating parameters of a model representing a probability distribution of censored data, the censored data including observation data of a sample with an observed value being observed, observation data of a sample with no observed value being observed, and a variable representing whether or not an observed value is observed for each sample, the program causing the computer to execute processing including: receiving, by an input unit, input of the censored data; and estimating, by a parameter estimation unit, the parameters of the model by optimizing an objective function that is a divergence between a model representing a probability distribution of the censored data expressed using a probability density function of observation data, the probability density function being represented by a mixture model of each component representing a distribution of observed values corresponding to each sample of the censored data received by the input unit, and a probability distribution of the censored data obtained from the censored data received by the input unit.
According to the estimation apparatus, the estimation method, and the program according to the present invention, the input unit receives input of censored data, the censored data including observation data of a sample with an observed value being observed, observation data of a sample with no observed value being observed, and a variable representing whether or not an observed value is observed for each sample.
The parameter estimation unit estimates parameters of a model by optimizing an objective function that is a divergence between a model representing a probability distribution of censored data expressed using a probability density function of observation data, the probability density function being represented by a mixture model of each component representing a distribution of observed values corresponding to each sample of the censored data which is received by the input unit, and a probability distribution of the censored data obtained from the censored data which is received by the input unit.
In this way, it is possible to suppress a computation time and accurately estimate parameters of a model representing a probability distribution of censored data by estimating the parameters of the model by optimizing an objective function that is a divergence between a model representing a probability distribution of censored data expressed using a probability density function of observation data, the probability density function being represented by a mixture model of each component representing a distribution of observed values corresponding to each sample of the censored data, and a probability distribution of the censored data obtained from the censored data, the censored data including observation data of a sample with an observed value being observed, observation data of a sample with no observed value being observed, and a variable representing whether or not an observed value is observed for each sample.
Moreover, in the estimation apparatus according to the present invention, the model representing the probability distribution of the censored data can be expressed using a probability distribution of the variable represented using the probability density function of the observation data and a length of a time to an observation end given in advance for each sample, and a probability distribution of the observation data with the variable being given, the variable being represented using the probability density function of the observation data and the length of the time to the observation end given in advance for each sample.
In the estimation apparatus according to the present invention, the objective function can be a Kullback-Leibler divergence or an L2 divergence between the model representing the probability distribution of the censored data and the probability distribution of the censored data.
According to an estimation apparatus, an estimation method, and a program of the present invention, it is possible to suppress a computation time and accurately estimate parameters of a model representing a probability distribution of censored data.
Hereinafter, an embodiment of the present invention will be described with reference to the drawings.
Principle of Estimation Apparatus According to Embodiment of the Present Invention First, a principle of an embodiment of the present invention will be described.
The embodiment of the present invention establishes two schemes using different objective functions illustrated below. A first one is an estimation scheme based on Kullback-Leibler (KL) divergence minimization between probability distributions, and a second one is an estimation scheme based on L2 divergence minimization between probability distributions. The first estimation scheme can estimate the parameters in a very simple repeated calculation, and the second estimation scheme can analytically estimate the parameters even without requiring repeated iterations.
The point for establishing the schemes according to the embodiment of the present invention is the use of an approach called an examplar based model (hereinafter, referred to as an emb) (Reference document 1).
In the emb approach, the parameters of each component of a mixture model are not explicitly handled as parameters, and each component is placed at a point where a data point is present. This can establish an algorithm that converges to a global optimum solution by formulation estimating only a mixing ratio (weighting parameters for each component). The scheme according to the embodiment of the present invention may be considered as one obtained by developing this approach such that censored data can be used as input.
Furthermore, although statistics of truncated distributions required for the EMCM described above are necessary for the estimation of the parameters of the component, adoption of this approach allows a scheme not requiring the repeated numerical calculations to be established.
Note that the method disclosed in the document (Reference document 1) proposes a scheme for performing the emb by the KL divergence minimization using normal data, but cannot handle the censored data.
A document (Reference document 2) also considers the approach using the L2 divergence for a function of error in the probability distribution, but assumes a situation in which the normal data is input rather than the censored data.
In addition, a document (Reference document 3) is a study using censored data and L2 divergence, but assumes only a case that a simple unimodal distribution such as an exponential distribution and a Weibull distribution is used as a model.
Preparation
First, the censored data will be described.
In the device failure data, a time of installation of each device and a time of failure when a failure occurred are recorded. As for devices 1 and 2, since failures occurred during an observation period, the times of the failures are recorded. On the other hand, as for devices 3 and 4, since failures did not occur during the observation period and the observation was ended, time of failures are not recorded.
However, it can be read that because any of the devices 3 and 4 may inevitably fail, the times of failures thereof are at and after a time of observation end. As described above, data including combination of data with observed values (time of failure) known like as for the devices 1 and 2 and data with observed values (time of failure) known to be equal to or more a certain value like as for the devices 3 and 4 is referred to as censored data.
The device failure data in
A user 1 simultaneously starts to use both services, and simultaneously cancels during the observation period, so the cancellation times of both services are recorded. A user 2 simultaneously starts to use both services, and cancels only a service 2 during the observation period. A user 3 simultaneously starts to use both services, and cancels only a service 1 during the observation period. A user 4 start to use the service 2 first, and additionally contracts the service 1 during the observation period. Thus, because of observation censoring, a cancellation time of the service 1 of the user 2, a cancellation time of the service 2 of the user 3, and a cancellation time of the services 1 and 2 of the user 4 are not recorded.
In this way, the two-dimensional censored data includes three types of censored data in which dimensions of censored values are different. In general, n-dimensional censored data includes 2n−1 types of censorings. Note that, hereinafter, an example in which whether the observation is censored in each dimension is determined will be described, but the same approach can be used in a situation where all elements are not observed when any one of the elements is censored.
Here, a definition of the censored data is given. For simplicity of handling, the observation data is expressed using a survival time (time from installation of a device to failure, time from service contract to cancellation), rather than a calendar time as in
Censored data is written as:
={xi,wi}i=1n
Here both
xi
and
wi
are d-dimensional vectors, where
x
i∈d
and
w
i∈{0,1}d
xij represents a usage time of a service j of a user i, and wij represents whether a cancellation time of the service j of the user i is recorded (wij=1) or is not recorded due to censoring (wij=0). Similarly, a length of a time to an observation end time of a i-th user is written as:
v
i∈d
vij represents a length from a usage start time to an observation end time of the service j of the user i. In a case that no observed value is observed due to censoring (wij=0), it is assumed that xij=vij is set.
Mixture Model
Next, a model used in the embodiment of the present invention will be described. A probability density function of an observed value, represented by a mixture model, is generally defined by the following an equation (1):
[Math. 1]
f(x|θ)=Σk=1Kθkψk(x)=θTψ(x). (1)
Where K represents the number of mixtures, and
ψk(x)
represents a probability distribution of the k-th component. For the probability distribution of a component
ψk(x)
a Gaussian distribution expressed by, for example, the following equation can be used.
Where
μk
and a represent a mean and a standard deviation of the Gaussian distribution, respectively. However, in accordance with the approach of the ebm (Reference document 1), parameters of the probability distribution of the component
(In a case of Gaussian distribution, μ1, . . . , μK)
is set as
(μ1, . . . ,μn)=(x1, . . . ,xn)
with K=n, and such that each corresponds to an observation data point. In this case, the probability density function of the observed value is represented by a mixture model of each component for each piece of the observation data, and the mean of the Gaussian distributions included in each component is used as the corresponding observed value. Since the scheme handles the censored data, K=n is set, and in a case that the value is observed (wij=1), μij=xij may be set, otherwise (wij=0), μij=xij+ε may be set. ε represents a randomly generated value from a probability distribution (e.g., an exponential distribution) that takes a value of 0 or more. In a case that the number of pieces of data is large, for example, only 100 pieces of data that are randomly selected may be used, or a component set based prior knowledge may be used. The standard deviation a can be determined by cross validation or the like.
A generative process of the censored data
={xi,wi}
when using the model described above can be described as follows. First, with the length of the time to the observation end for each sample
vi
being known, a variable representing whether censoring occur
wi
is generated in accordance with a probability distribution of an equation (2) below:
Where among
xi
and
vi,
a set of elements in the case of wij=1 with observed values being observed is set as below:
x
i
o
={x
ij
|w
ij=1},vio={vij|wij=1}
Similarly, a set in the case of wij=0 with no observed value being observed is set as below:
x
i
c
={x
ij
|w
ij=0},vic={vij|wij=0}
In a case that the elements with the observed values being observed are not distinguished from the elements with no observed value being observed, those are referred to as observation data. In a case that a standard probability distribution, such as the Gaussian distribution of the equation (2) above is used for the probability distribution of the component
ψk(x),
ψkw(wi;vi)
be calculated analytically using a cumulative density function.
In a case that
w
i≠0
and at least one observed element is present,
xi
is generated in accordance with a distribution expressed by an equation (4) below.
[Math. 4]
P(xi|wi,θ)=δ(xic−vic)ftr(xio|wi,θ) (4)
Where
δ(⋅)
is a delta function, and ftr represents a truncated distribution of a distribution obtained by marginalizing f with respect to the not observed element and expressed by an equation below.
Where, ftr is according to equations (5) and (6) below.
In a case that the component is the Gaussian distribution of the equation (2) described above,
ψikx
is expressed as below:
ψikx
μkx
are vectors obtained by extracting elements of dimensions corresponding to
xio
and
xic
from
μk, respectively.
In a case that
wi=0
and no observed element is present, an expression with only a delta function is given as expressed by an equation (7) below.
[Math. 7]
P(xi|wi=0,θ)=δ(xic−vi) (7)
Accordingly, in summary, a generative probability of each piece of censored data is given by an equation (8) below:
[Math. 8]
P(xi,wi|θ)=P(xi|wi,θ)P(wi|θ) (8)
KL Divergence and L2 Divergence
Next, the divergences utilized in defining an objective function of proposed schemes are described. As is well known, the Kullback-Leibler (KL) divergence for probability distributions p(x) and q(x) is defined by an equation (9) below:
In addition, in the embodiment of the present invention, a case in which the L2 divergence (Reference document 2) defined below is also used will be described (equation (10) below).
The L2 divergence is defined as a squared error of two probability density functions. Which divergence should be used depends on the problem. Thus, in the embodiment of the present invention, two types of schemes have been established so that any divergence can be used.
Estimation by Optimization of KL Divergence
First, a proposed scheme in a case that the KL divergence is used as an objective function is described. The KL divergence between a model representing a probability distribution of censored data
P(x, w|θ)
and a true probability distribution obtained from the censored data
is given by an equation (11) below, in accordance with the definition of the equation (9) above.
This can be deformed as in an equation (12) below.
Constant terms are removed and expected values for the true distribution unknown
are replaced with sample means to derive an equation (13) below.
Where
KL(θ)
is an amount capable of being calculated from the data, and is used as an objective function to derive an algorithm. Specifically, an optimization problem of an equation (14) below may be solved.
Here, constraint conditions (where the elements of the parameters are equal to or more than 0 and the sum is 1) is for the probability distribution of the mixture model f. Using the Lagrange method of undetermined multipliers, it can be seen that the solution for the optimization problem described above satisfies an equation (15) below.
Thus, based on the equation (15) above, optimization is possible by repeating the updates of
{circumflex over (θ)}.
Note that, as in the scheme of Reference document 1, in order to reduce the amount of calculation and accelerate the convergence, in a case that θk is smaller than a certain threshold (e.g., 10−3/n) during the parameter update, θk=0 may be set, and then, a re-normalization operation may be performed to adjust the overall to satisfy the sum to one.
Estimation by Optimization of L2 Divergence
Next, a proposed scheme in a case that the L2 divergence is used is described. Unlike the KL divergence, an objective function is not directly defined from the definition of the L2 divergence, and a new objective function is defined by focusing the objective function when using the KL divergence.
Focusing on the objective function (equation (12)) when using the KL divergence, it can be seen that two terms of a term corresponding to the KL divergence of a marginal distribution of the mixture model f
fx
the true distribution of the observed values with variables being given
and a term corresponding to a log-likelihood ratio using the distribution of the model of variable representing that the observed value is not observed
the true distribution
are weighted by
and summed.
Based on this insight, the objective function can be designed as the follow equation by replacing the parts using the KL divergence/log-likelihood ratio in each of these two terms with the L2 divergence.
This is deformed to obtain an equation below:
Further, consider optimization of an equation below obtained by removing the constant terms (Const) and replacing the means for the true distribution with the sample means.
Where
n
w=Σi=1nI(wi=w)
is set. This is used as the objective function to derive the algorithm.
L2(θ)
can be expressed in a matrix-vector form as in equations (16) and (17) below.
Where the above is according to equations (18) to (20) below.
Also, in a case that a Gaussian distribution is used for the distribution of the components,
Ĝ
is an analytically computable value and can be expressed as an equation (21) below.
It is noted that, in the equation (21) above, regarding summing up for
w,
dx
or
μkx
is a value different depending on a value of
w.
It is found from this that the objective function is expressed in a secondary form for
θ.
Accordingly, an estimate value of the parameter can be obtained by solving a constrained secondary optimization problem described below.
A numerical solver can be used to directly solve the optimization problem of an equation (22) above. At this time, an optimization problem described below to which a normalization term is added may be solved.
Where, β represents a hyper parameter. In addition, an approximate method described below may be used, for example. An optimum solution for a problem in which the constraint is removed from the optimization problem of the equation (22) above is found as expressed in equations (23) and (24) below.
Where,
is a normalization term and β is a hyper parameter, which has an effect of preventing parameter dissipation.
θ
has a value equal to or more than 0 and the sum is 1, and thus,
{circumflex over (θ)}
satisfying a condition is obtained by processing of an equation (25) below.
Where,
|⋅|
represents
a 1 norm.
In the embodiment of the present invention, parameters of a model are estimated using any of two schemes described above to make it possible to suppress a computation time and accurately estimate the parameters of the model representing a probability distribution of censored data.
Configuration of Estimation Apparatus According to Embodiment of the Present Invention
A configuration of an estimation apparatus 1 according to the embodiment of the present invention will be described with reference to
As illustrated in
As illustrated in
The data processing unit 10 stores, in a data storage unit 41, censored data
={xi, wi} including
xio
that is observation data of a sample with an observed value being observed, the observed value being received by the input unit 50,
xic
that is observation data of a sample with no observed value being observed, and
wi
that is a variable representing whether or not an observed value is observed for each sample. The sample refers to, for example, each device in the examples of
The parameter estimation unit 20 estimates parameters of a model
θ
by optimizing an objective function that is a divergence between
P(x, w|θ)
that is a model representing a probability distribution of censored data
expressed using a probability density function of observation data, the probability density function being represented by a mixture model of each component representing a distribution of observed values corresponding to each sample of the censored data
which is received by the input unit 50
and
that is a true probability distribution of the censored data
obtained from the censored data
received by the input unit 50.
Specifically, the parameter estimation unit 20 first determines the true probability distribution
of the censored data
Next, the parameter estimation unit 20 estimates the parameters assuming that the objective function is a KL divergence or an L2 divergence between
P(x, w|θ)
that is a model representing a probability distribution of the censored data
and
that is a true probability distribution of the censored data
Here,
P(x, w|θ)
that is the model representing the probability distribution of the censored data is expressed, as expressed by the equation (8) above, using
P(xi|wi, θ)
that is a probability distribution of observation data represented using a probability density function of the observation data and a length of a time to an observation end given in advance for each sample, and
that is a probability distribution of variables with the variables being given, the variables being represented using the probability density function of the observation data and the length of the time to the observation end given in advance for each sample.
The parameter estimation unit 20, in a case of using the KL divergence, estimates the parameters by repeating the parameter update of the equation (15) above. The parameter estimation unit 20, in a case of using the L2 divergence, estimates the parameters using the equations (24) and (25) above.
Then, the parameter estimation unit 20 stores the estimated parameters
θ
in a parameter storage unit 42.
The parameter output unit 30 acquires the parameters
θ
in the parameter storage unit 42 to pass the acquired parameters
θ
to the parameter output unit 30.
The storage unit 40 includes the data storage unit 41 and the parameter storage unit 42. The censored data
is stored in the data storage unit 41. The parameters of the model
θ
are stored in the parameter storage unit 42.
The input unit 50 receives the censored data
input from the external apparatus 2. Then, the input unit 50 passes the received censored data
to the data processing unit 10.
The output unit 60 outputs the parameters of the model
θ
received from the parameter output unit 30 to the external apparatus 2.
Action of Estimation Apparatus According to Embodiment of the Present Invention
Once the censored data
={xi, wi}
is input to the input unit 50, an estimation processing routine illustrated in
First, in step S100, the input unit 50 receives input of censored data
={xi, wi} including
that is observation data of a sample with an observed value being observed,
xic
that is observation data of a sample with no observed value being observed, and
wi
that is a variable representing whether or not an observed value is observed for each sample. The input unit 50 accepts an input of a length of a time to an observation end time for each sample i.
In step S110, the parameter estimation unit 20 estimates the parameters assuming that the objective function is a KL divergence or an L2 divergence between
P(x, w|θ)
that is a model representing a probability distribution of the censored data
and
that is a true probability distribution of the censored data
by repeating the parameter update of the equation (15) above, or using the equations (24) and (25) above.
In step S120, the output unit 60 outputs the parameters
θ
estimated in step S110 above.
As described above, according to the estimation apparatus in the embodiment of the present invention, the objective function described below is optimized to make it possible to suppress a computation time for estimating parameters of a model and accurately estimate the parameters of the model representing a probability distribution of censored data. Here, the objective function is given by a divergence between a model representing a probability distribution of censored data described below and a probability distribution of the censored data obtained from the censored data. Here, the model representing the probability distribution of the censored data is expressed using a probability density function of the observation data described below. The probability density function is represented by a mixture model of each component representing a distribution of observed values corresponding to each sample of the censored data, the censored data including observation data of a sample with an observed value being observed, observation data of a sample with no observed value being observed, and a variable representing whether or not an observed value is observed for each sample.
Note that the present invention is not limited to the above-described embodiment, and various modifications and applications may be made without departing from the gist of the present invention.
For example, in the embodiments described above, the case that the KL divergence or the L2 divergence is used as a divergence is described, but the present invention is not limited thereto, and another divergence can be used.
Further, in the embodiment described above, the description is given assuming the censored data that is time-series data, but the present invention is not limited thereto, and the present invention is also applicable to any censored data that is not time-series data.
Furthermore, the estimation apparatus 1 according to the embodiment described above is described as that configured such that the processing of each part is established as a program, installed on a computer used as an estimation apparatus, and executed, but may be configured to be distributed via a network.
In addition, although an embodiment in which the programs are installed in advance has been described in the present specification of the present application, such programs can be provided by being stored in a computer-readable recording medium.
Number | Date | Country | Kind |
---|---|---|---|
2019-029769 | Feb 2019 | JP | national |
Filing Document | Filing Date | Country | Kind |
---|---|---|---|
PCT/JP2020/004909 | 2/7/2020 | WO | 00 |