The exemplary embodiment relates to the modeling of independent complex systems over time. It finds particular application in the detection of soft failures in device infrastructures and will be described with particular reference thereto. However, it is to be appreciated that the exemplary embodiment is also applicable to the modeling of other systems based on discrete observations.
There are many applications where it is desirable to decompose one or more complex processes, called observations, into a small set of independent processes, called sources, where the sources are hidden. The following tasks are of particular interest:
inference: given a sequence of observations and knowing the link between sources and observations, find the values of the sources that best fit what has been observed.
learning: given a large number of observations, find the link between sources and observations that best fits the observations.
One example of this problem is in an infrastructure of shared devices, such as a network of printers, where an administrator has the task of maintaining the shared devices in an operational state. Although some device malfunctions result in the sending of a message by the device itself, or result a catastrophic failure which is promptly reported by a user, other malfunctions, known as “soft failures” are not promptly reported to the administrator. This is because the device does not become unavailable, but rather suffers a malfunction, degradation, improper configuration, or other non-fatal problem. When a particular device undergoes a soft failure, the pattern of usage of the device often changes. Users who would typically use the device when it is functioning normally, tend to make more use of other devices in the network which provide a more satisfactory output. Since soft failures result in productivity losses and add other costs to the operators of the network, it is desirable to detect their occurrence promptly.
Statistical models, such as Hidden Markov Models (HMM) and Independent Component Analysis (ICA), have been used to model processes in which some variables are hidden, but are assumed to be statistically related to observed variables. The HMM makes certain assumptions, including that the values of the hidden variables (states) depend only upon previous values of the hidden variables, that the value of each hidden variable is independent of the values of the other hidden variables, and that the values of the observed variables depend only on the current values of the hidden variables. Under these assumptions, a time sequence of values of the hidden variables is inferred from the temporal variation of the observed variable values and knowledge of the parameters of the stochastic process relating the observed variables to the hidden ones. A known extension of the HMM approach is the factorial hidden Markov model (FHMM) described, for example, in Z. Ghahramani and M. I. Jordan, “Factorial Hidden Markov Models,” in David S. Touretzky, Michael C. Mozer, and Michael E. Hasselmo, editors, Advances in Neural Information Processing Systems, volume 8, pp. 472-478 (MIT Press, 1996), hereinafter, “Ghahramani, et al.”
In FHMM, exact inference has a complexity which is exponential in the number of hidden dynamics, and approximate inference techniques are generally required. Jordan and Ghahramani proposed a variational inference framework to estimate the parameters (See Ghahramani, et al., above, and Michael I. Jordan, Zoubin Ghahramani, Tommi S. Jaakkola, and Lawrence Saul, “An Introduction to Variational Methods for Graphical Models,” in Michael I. Jordan, ed., Learning in Graphical Models (Kluwer Academic Publishers, Boston, 1998), hereinafter, “Jordan, et al.”
Existing FHMM implementations generally operate on observed variables that are continuous. For example, the variational inference framework of Ghahramani, et al. is limited to continuous (Gaussian) observation variables. The hidden states, on the other hand, are assumed to be discrete, and the number of possible states for a given hidden dynamic is an input parameter to the FHMM analysis.
In many practical applications, however, the observed variables are also discrete, such as in the soft failure problem described above. It would be desirable to determine the printer states of printers of a digital network (the hidden parameters) based on observed choices of printer destination made by users (the observed parameters). Each choice of printer destination is a discrete observation limited to N discrete values where N is the number of available printers on the network. It would be advantageous if the set of usage observations could be employed to glean an understanding of the hidden underlying states of the devices in the network using a probabilistic model.
This problem is not limited to networked devices, such as printers, however, but is very general, since it applies to the analysis of any kind of processes producing sequential and high-dimensional discrete data, which must be decomposed into a smaller set of factors to be tractable.
Attempts to analyze FHMM with discrete observations have been less than fully satisfactory. While Jordan, et al. suggests that discrete observations may be accommodated using sigmoid functions, no algorithm is provided. The suggested approach of employing sigmoid functions, even if actually implementable, would have the disadvantage that the sigmoid functions are not well-representative of occurrences of (presumed) independent discrete events such as print job submissions. Other proposed approaches for accommodating discrete observations have been limited to discrete binary observations, and are not readily extendible to observations that can take three or more different discrete values.
There remains a need for an efficient model for solving the inference and learning tasks of FHMMs when the observations are discrete-valued, such as event counts.
The following references, the disclosures of which are incorporated herein in their entireties by reference, are mentioned.
U.S. Pub. No. 20060206445, published Sep. 14, 2006, entitled PROBABILISTIC MODELING OF SHARED DEVICE USAGE, by Jean-Marc Andreoli, et al., discloses methods for estimating parameters of a probability model that models user behavior of shared devices offering different classes of service for carrying out jobs. Parameters of the probability model are learned using the recorded job usage data, the determined range of service classes, and the selected initial number of job clusters.
U.S. Pub. No. 20060132826, published Jun. 22, 2006, entitled AUTOMATED JOB REDIRECTION AND ORGANIZATIONAL MANAGEMENT, by Ciriza et al., relates to a method for managing a plurality of communicatively coupled systems, such as printers, which includes collecting job log data and providing an automated print job redirection away from a malfunctioning printing device.
In accordance with one aspect of the exemplary embodiment, a method for analyzing a hidden dynamic includes acquiring discrete observations, each discrete observation having an observed value selected from two or more allowed discrete values. A factorial hidden Markov model (FHMM) relating the discrete observations to a plurality of hidden dynamics is constructed. A contribution of the state of each hidden dynamic to the discrete observation is represented in the FHMM as an observational parameter which scales at least one nominal parameter which is derived from a nominal distribution of the observations. States of the hidden dynamics are inferred from the discrete observations based on the FHMM. Information corresponding to at least one inferred state of at least one of the hidden dynamics is output.
In accordance with another aspect, a method for predicting states of sources in a system which includes a plurality of sources is provided. The method includes generating a model of the system in which each of a plurality of the sources may assume any one of a plurality of discrete hidden states at a given time, deriving parameters for training a variational expectation maximization algorithm based on discrete observations at a plurality of times, and applying the variational expectation maximization algorithm to a new set of discrete observations to infer the current state of at least one of the sources.
In accordance with another aspect of the exemplary embodiment, a system includes a plurality of shared devices, each of the devices capable of assuming a plurality of hidden states. A statistical analysis system acquires discrete observations for the plurality of devices and predicts hidden states of the plurality of devices based on the discrete observations. The statistical analysis system employs a factorial hidden Markov model (FHMM) which relates usage data with a plurality of states of the devices, the statistical analysis system inferring states of the devices from the discrete observations based on the FHMM and outputting information corresponding to at least one inferred state of at least one of the devices.
The exemplary embodiment relates to an apparatus and method in which statistical analysis of discrete observations provides useful information about underlying states of a system of interest. In various aspects, a Factorial Hidden Markov Model (FHMM) is used to model the system over time. The method employs inference and learning algorithms which allow the method to be scaled up to the analysis of several hundred independent dynamics and high-dimensional observations.
The exemplary embodiment provides algorithms for FHMM which allow modeling of systems with discrete observables. Using a sparse representation of the data, these algorithms are well adapted to high dimensional discrete data that arise in many applications. More generally, when dealing with multidimensional time series, many observations may be correlated or even redundant. One way to suppress these correlations is to identify independent factors, as in Principal Component Analysis. Such a reduced set of factors is easier to grasp by application experts and can be traced back to the process at hand and help better explain it. The exemplary method is also useful in the design of easy to grasp human-oriented visualizations of the data.
As used herein, a job is an interaction between a user and a device for carrying out a selected function (e.g., printing).
In one embodiment, the method is used in the analysis of device logs. These logs record jobs printed and a limited amount of information, such as the time, the identity of the user, and the identity of the device. The events in a group of shared devices can be logged and analyzed, in accordance with the method described herein, to provide a wealth of information about the behavior of the users which may be of benefit to designers and maintenance operators. A suitable model for explaining such logs is to consider that the users have a “nominal” behavior which is independent of time, and which is temporally modified by the combined effects of multiple independent dynamics capturing failure modes in components of the devices.
In one embodiment, the method is used for soft failure detection in device infrastructures. By “soft failure,” it is meant that a device fails to operate properly, but this failure is observed through changes in usage patterns, rather than being reported by the device itself. A goal, in this embodiment, is to detect abnormal use of shared devices, such as printers and copiers, in an infrastructure such as an office. The abnormality can be due to a device failure, in which case several users may then switch from one device to another. Or it can be caused by a sudden change in user behavior, for example, when a user changes to a different office. A natural assumption is the independence of the device failures. The proposed algorithm scales up linearly with the number of hidden dynamics, so it can be applied to infrastructures involving hundreds of devices.
While the examples of the usefulness of the exemplary method focus on devices such as printers, it should be appreciated that the exemplary method is not limited to the particular applications described.
For example,
In yet other embodiments, any one or more of the computers 22, 24 in
Depending on the embodiment in which the statistical analysis system 36 operates, discrete observations in the form of usage data may be recorded (or retrieved from one or more recording facilities) from any one or a combination of the printers and print server(s). In one embodiment, job usage data (typically counts) is retrieved from job log data recorded at the printers. In an alternate embodiment, recorded job usage data is a recorded job log data stored, for example, on a centralized job spooler or print server. In yet another embodiment, job usage data is accumulated individually by printers operating on the network through a distributed exchange of information (e.g., via a defined negotiation protocol). Observations on the user may be recorded (e.g., the user's network account on the computer 22, 24 from which the print job was sent).
While the exemplary embodiment is described in relation to a network of shared devices, such as printers, it is to be appreciated that the methods described are applicable to the prediction of underlying states of sources other than printers, based on discrete observations other than counts. By discrete observations, it is meant that the observations range over a countable set of values. Thus, unlike temperature, which is a continuously changing variable, counts, for example, are recorded intermittently, e.g., when a user decides to use a particular printer and may assume integral values, such as 0, 1, 2, 3, 4, etc, counts in a given time period.
In the exemplary embodiment, an efficient algorithm is used to provide a maximum likelihood estimation of the parameters in a FHMM problem of the type illustrated in
Development of a Statistical Model
The following description describes the development of a general model which may be used in the statistical analysis of discrete observations.
This section describes the aspects related to the operations for developing a model, as shown in
1. Model Definition (S100)
Every observed signal is denoted by y=(y1, . . . , yT) where T is the length of the observation sequence. An individual observation yt at time t is a J-dimensional vector containing discrete values which are usually counts of different types of events which occurred since the last observation. ytj denotes the value of the j-th feature at time t. It is assumed that these observations are generated by K sources. The state of all the sources is represented by xt=(xt1, . . . , xtk) where xtk is the state of the k-th source at time t.
1.1 Latent State Dynamic
First, it is assumed that the K sources are mutually independent, i.e.
where p is the joint probability distribution of all the states.
The k-th source is modeled by a Markov chain with initial proportions Pk of size Lk and a transition matrix Tk of size Lk×Lk, where Lk is the number of possible states of the k-th dynamic. The distribution of xtk of the state of the k-th dynamic at time t is defined recursively as usual for Markov chains:
p(x1k=l)=Pkl (1)
p(xtk=l′|x(t−1)k=l)=Tkll′ (2)
1.2 Observation Distribution
For the observation probability, several choices are possible, depending on the application. Although the exemplary method is applicable to any distribution in the exponential family, a focus here is placed on multinomial and Poisson distributions, which arise naturally when the observations are event counts. For example, consider models where the natural parameters of the observation distribution at time t depend only on the states of the hidden dynamics {xt1, . . . , xtK} at time t. More precisely, given the states l1, . . . , lK of the Kdynamics, it is assumed that the observation distribution is obtained from a known nominal distribution of the exponential class, which depends only on time, by applying multiplicative contributions, dependent on the different states lk, to nominal parameters derived from the nominal distribution. Thus, each dynamic contributes to amplify or, on the contrary, attenuate the effects of the nominal parameters on the observed counts. The contribution of the k-th dynamic being in state l on the j-th observation is captured by a single scalar observational parameter βjkl. Multinomial and Poisson cases may be treated as follows. However, it should be noted that the two cases lead to similar equations and can be cast into the same algorithm, as outlined in further detail below.
Poisson Observations:
In this case, it is assumed that the j-th discrete variable ytj at time t is sampled from a Poisson distribution with nominal parameter (here, the mean of the distribution) λtj, so that the actual parameter is:
It may further be assumed that, conditionally to the states of the hidden dynamics, the J discrete variables ytj are independent. The overall observation distribution is therefore given by:
Multinomial Observations:
In this case, it is assumed that the sum nt=Σj=1Jytj of the discrete variables ytj is a global count which is known at each time, and that each ytj represents a contribution to that count (hence they are not independent). The overall nominal observation distribution is assumed to be a multinomial distribution with parameters ptj, so that the actual parameters are:
(the proportionality coefficient is given by the intrinsic constraint that the probabilities sum to one). Hence the overall observation distribution is:
1.3 The Generative Model
To summarize, the parameters of the model are:
where α represents a transitional parameter which relates to a probability that a source is in a given state at a first time (t), knowing its state at a second time (t−1) prior to the first time.
where β represents an observational parameter: the contribution of the k-th source being in state l on the j-th observation. Note that because parameters βjkl appear as argument of an exponential function, their effect is multiplicative. If βjkl is null, the k-th dynamic, when in state l, has no effect on the distribution of the j-th observation; if it is positive, the effect is to multiplicatively increase the corresponding parameter of that distribution, while if it is negative, the effect is a reduction.
2. Inference (S110)
Inference aims at finding the conditional distributions p(x|y, β, α) of the hidden variables x knowing the model parameters (α, β) and the observations y. The computation of this distribution, although theoretically possible using general inference methods such as the Forward-Backward algorithm (as described, for example, in C. M. Bishop, PATTERN RECOGNITION AND MACHINE LEARNING (Springer, 2006)), becomes exponentially expensive as the number of hidden dynamics grows. For problems involving large values of K, for example, K>10, an exact computation may be computationally too expensive. Approximations can be obtained by aiming at a “projection” of the target distribution on a computationally tractable subspace of the space of all possible FHMM distributions. Formally, this projection is defined as the distribution in which is closest to the target one by the Kullback-Leibler divergence KL. The problem is therefore decomposed into two steps: (i) choice of the subspace and (ii) resolution of the following optimization problem.
If contains the target distribution p(x|y,α,β), then the solution to the above problem is exact (the Kullback-Leibler divergence KL equals 0). The choice of subspace therefore implies a trade-off between tractability of the optimization problem (7) and complexity of the subspace in the overall space of possible distributions. may be chosen to be itself the space of independent Markov chains, i.e. qε if and only if:
This choice may be viewed as a crude approximation of the posterior. Better approximations can be obtained by clustering together some of the Markov chains, and observing that a cluster of Markov chains is equivalent to a single one in higher dimension. Exact computation corresponds to the case where all the chains are clustered into a single one. A tradeoff between the quality of the approximation and the dimensionality (hence complexity) of the problem can be obtained by varying the number of clusters, as observed in, for example, E. Xing, M. Jordan, and S. Russell, A GENERALIZED MEAN FIELD ALGORITHM FOR VARIATIONAL INFERENCE IN EXPONENTIAL FAMILIES, in Proc. of the 19th Annual Conf. on Uncertainty in Al (2003). The algorithms below are given for the fully factorized case (i.e. one chain per cluster), but would work just as well with intermediate levels of clustering (as illustrated in
Using Bayes law and the factorial decomposition in , it is easy to transform the minimization problem (7) into the maximization of the following quantity:
which lower bounds the likelihood log p(y|α, β) (see, for example, Jordan, et al.). The solution to this maximization problem is given below in section 4 and results in Algorithm 4.1. In the general case, in the prediction step, both α and β are assumed known. For the learning stage (described below), β is assumed unknown and is computed by a standard expectation maximization loop which uses the predictive algorithm based on the current estimate of β at each occurrence of the loop.
Algorithm 4.1
Initialisation
It may be noted that the two cases for the observation distribution (Poisson or multinomial) are amenable to the same treatment. In fact, the exact computation of Eq[log p(y|x, β)] in is only feasible with the Poisson observations model.
In the multinomial observations case, an additional variable parameter φ is introduced, together with a lower bound of dependent on φ, so that the inference problem becomes the maximization of the lower bound according to q and φ. In both cases, the maximum of the objective function (either or ) is found using a component wise maximization procedure along the components qk of q, each of these maximization subproblems being solved by a Forward-Backward algorithm, as described, for example, in Ghahramani and Jordan: for every dynamic k, the function F
Scalability: Let {tilde over (L)}=maxk Lk be the maximum size of the state range of the hidden dynamics, and
the proportion of non-zero elements in the data, each step of the predictive algorithm has a complexity of O(T{tilde over (L)}2Kp) and requires O(T{tilde over (L)}Kp) memory units. Note here the importance of dealing with sparsity: given the dimensions involved, manipulating full matrices would be unfeasible.
3. Learning (S104)
When β is unknown, an estimate of the Maximum A Posteriori (MAP) value can be obtained by maximizing the objective function ( or ) with respect to β. This corresponds to the Variational EM algorithm where the E step at each occurrence of the EM loop is computed by the algorithm of the previous section using the current estimate of β and the M-step updates the estimate of β using a conjugate gradient method, since the gradient of the objective can be computed easily.
These methods assume that the number of dynamics and the (finite) state space of each of them are known a priori. If this is not the case, they can always be learnt by experimenting with different values and selecting one by cross-validation on the likelihood.
Having given an overview of the processing with respect to
4. Generation of the Algorithm (S100)
To simplify notation, indicator vectors are used to represent the state variables xt=(xt1, . . . , xtk). Defining SL={vε{0,1}L, Σt=1Lvl=1}, every state variable xtkεSL
4.1 Hidden State Dynamic
With the indicator vector notation, the probability distribution of the state of the k-th chain given by (1) and (2) can be rewritten as:
The joint probability of the hidden state is therefore:
In the Table shown in
There are 4 time instants (represented by columns) and three dynamics (sources) A, B, C. Source A can be in any one of four states, represented by rows 1-4, source B can be in any one of two states, represented by rows 5-6, and source C in any one of three states, represented by rows 7-9. The initial state of the first source is x11=(0,1,0,0), which means that it is in the state 2 (out of 4 possibilities). Similarly, the second dynamic is initially in the state 1 (out of 2 possibilities).
4.2 Observations Distribution
The middle array of the table in
Poisson Case:
{tilde over (λ)}tj(xt)=λtj exp(βjtxt)
Multinomial Case:
{tilde over (p)}tj(xt)∝ptj exp(βjtxt)
The result of these dot-products are illustrated in the third array of the Table in
4.3 Derivation of the Algorithm (S110)
An exemplary method for evaluation of Eq[log p (y|x, β)] in the maximization problem (9) is set forth in this section.
4.3.1 Poisson Case
In the case of Poisson observations, the quantity Eq[log p (y|x, β)] can be computed exactly:
4.3.2 Multinomial Case
In the case of multinomial observations, the implicit dependency between observations due to the constraint on their sum makes the computation harder:
The expectations in the first sum are Eq[Σβjktxtkytj]=ytjΣk=1Kβjkγtk. The expectations in the second sum have an exponential number of terms, but the fact that the logarithm is a concave function can be used to define a lower bound A Taylor expansion of the logarithm at point nt/φt gives:
where the non-negative scalars φt are additional variable parameters. Their values will be optimized to maximize the lower bound:
4.3.3 The Algorithm
Referring back to section 4.1, note that the multinomial and the Poisson cases have very similar forms. In the following, both approaches are merged by considering that for every t, Δt=φt. For multinomial observations, the values φt are estimated by maximizing and for Poisson observations, they are given a priori. The lower bound has the following form:
This function is maximized using a coordinate ascent algorithm If the terms depending on the kth chain are isolated, the lower bound is:
A maximization of (13) relative to qk can be performed using the Forward-Backward algorithm for HMM. The values of the matrix Ok={Otkl} corresponds to the log-probability of the tth observation if xtk is in state l. All these steps are summarized in the predictive algorithm which take as input the observations y, the Markov chain parameters α, the weights β and the tolerance on the stopping rule defined as a scalar ε>0.
Without intending to limit the scope of the exemplary embodiments, the following example demonstrates the effectiveness of the exemplary algorithm.
In this application, usage logs are observed in a print infrastructure such as an office where multiple devices are offered for shared use to a set of users.
The observed variables are pairs (ut, dt)t=1, . . . , T, where ut denotes the user who invoked the t-th job (ranges over {1, . . . , Nu}) and dt denotes the device in that job (ranges over {1, . . . , Nd}).
It is assumed that the stochastic matrix πij of user print profiles is given, where πij is the nominal probability that user i will use device j prior to any knowledge about the state of the devices.
It is assumed that at each job t, each device j is in a state stjε{1, . . . , Lj}, and that for each state, lε{1, . . . , Lj}, the confidence attributed by the users to a device known to be in state l is captured by a coefficient 0<αl≦1. More precisely, the probability that user i chooses device j knowing all the device states at the time of job t is obtained from the nominal probability of i choosing j (prior to the knowledge of the states) multiplied by the confidence coefficient attached to the state of j:
p(dt=j|ut=i,st)∝as
the coefficient of the proportionality being chosen so that the probabilities sum to 1.
There are J=Nd observed counts defined by ytj=I{d
There are K=Nd+1 dynamics defined by:
xt=(st1, . . . ,stN
The first Nd components represent the hidden states of the devices. The last component represents the user chosen at each job. This introduces a slight variation with respect to the framework described above, where it was assumed for simplicity sake that all the dynamics were hidden. It would have been straightforward to account for observed dynamics. In practice, in Algorithm 4.1, the component wise maximization for the observed dynamics can be skipped.
With these assumptions, it can be seen that, upon one single job, the vector of job counts on the different devices given the device states and selected user follows a multinomial distribution:
which corresponds to Eqn. 6. Matching the parameters of the multinomial with Eqn. 5, then, for j=1, . . . , J:
Without intending to limit the scope of the exemplary embodiment,
The simulated results demonstrate that the inference problem (guessing the state of the devices by observing their usage) is an instance of the approach developed in the previous sections and that useful predictions can be made as to the state of a device over time.
5. Extensions
FHMM are not limited to chains but can also be defined for Gaussian processes (see e.g. A. Howard and T. Jebara. Dynamical systems trees. In Proc. of Uncertainty in Artificial Intelligence, 2004). The present focus on Markov chains is only due to our target applications, which deal with time-series of non differentiable observations (typically, failures). Discrete states for the hidden dynamics are assumed in the illustrated examples. A continuous dynamic can also be used, for example by approximating a continuous dynamic using Markov chains with multiple leveled states. In one embodiment, it is contemplated that the method may mix both continuous and discrete dynamics in a Factorial switching Kalman filter to efficiently model continuous processes, such as aging or graceful degradation of systems, together with discrete processes such as working or non-working operation modes.
It will be appreciated that various of the above-disclosed and other features and functions, or alternatives thereof, may be desirably combined into many other different systems or applications. Also that various presently unforeseen or unanticipated alternatives, modifications, variations or improvements therein may be subsequently made by those skilled in the art which are also intended to be encompassed by the following claims.
Number | Name | Date | Kind |
---|---|---|---|
7165029 | Nefian | Jan 2007 | B2 |
7209883 | Nefian | Apr 2007 | B2 |
7424464 | Oliver et al. | Sep 2008 | B2 |
20040181712 | Taniguchi et al. | Sep 2004 | A1 |
20040260548 | Attias et al. | Dec 2004 | A1 |
20050256713 | Garg et al. | Nov 2005 | A1 |
20060132826 | Ciriza et al. | Jun 2006 | A1 |
20060206445 | Andreoli et al. | Sep 2006 | A1 |
20070268509 | Andreoli et al. | Nov 2007 | A1 |
Number | Date | Country | |
---|---|---|---|
20080300879 A1 | Dec 2008 | US |