1. Field of the Invention
The invention relates to the construction of Hidden Markov Models (HMMs), and in particular to a method for biasing an algorithm that constructs the HMMs by means of a similarity metric.
2. Description of the Related Art
Hidden Markov Models (HMMs) are a well-known method for modeling the probability distributions of sequences of data. An introduction to HMMs and to an algorithm used to construct HMMs from observations, is presented in J. Bilmes, “A Gentle Tutorial On The EM Algorithm And Its Application To Parameter Estimation For Gaussian Mixture and Hidden Markov Models,” Technical Report ICSI-TR-97-021, International Computer Science Institute, Berkeley, Calif., 1997, (hereinafter referred to as Bilmes 97). Some of the fundamental concepts of Bilmes 97 are discussed below.
A probability distribution (P) over a number (i) of sequences of observations (O1,O2, . . . ) is called a Markov Chain. P(Oi|Oi-1, . . . , O1)=P(Oi|Oi-1), namely, if the ith symbol is conditionally independent of the symbols that occurred before the (i-1)st symbol, given the (i-1)st symbol. It is also common to say that observations O1,O2, . . . form a Markov Chain.
A probability distribution P over sequences of observations O1,O2, . . . is called a Hidden Markov Chain if for every observation Oi there exists a hidden variable Xi, and a probability distribution Q over sequences of pairs (O1,X1),(O2,X2), . . . such that
The Xi's take value in a finite set;
P is the marginal of Q over sequences of observations O1,O2, . . . Marginal is the unique distribution over sequences of observations O1,O2, . . . induced by Q.
The hidden variables X1,X2, . . . form a Markov Chain;
Ot is independent of all other variables given Xt. If Xt=i, the chain is said to be in a state i at a time t.
The probability distribution P can be estimated from sequences of observations called training sets, using an algorithm known as the Baum-Welch algorithm, which is fully described in Bilmes 97. The Baum-Welch algorithm actually estimates Q and derives P from that Q estimation. More specifically, Q is determined by
The probability distribution of X1:
The transition probability matrix A, whose entry in the ith row, jth column is Aij=Q(Xt=i|Xt-1=j)
The conditional probability distribution B of the observations given the corresponding hidden variables:
The Baum-Welch algorithm alternates between two phases. In the first phase, known as the E-step, the sequences of observations and the values of π, B, and A computed during the previous iteration, or the initial guesses during the first iteration, are used in the backwards-forwards procedure to compute the following quantities:
αi(t)=Q(O1=o1, . . . ,Ot=ot, Xt=i), the probability of observing the first t observations and of being in a state i at a time t,
βi(t)=Q(Ot+1=ot+1, . . . ,OT=oT|Xt=i), the probability of observations from t+1 to the end of the sequence T, given that the state at time t is i.
The above quantities and the values of π, B, and A computed during the previous iteration are further used to compute the following:
γi(t)=Q(Xt=i|O1=o1, . . . ,OT=oT), the conditional probability of being in a state i at a time t given the entire sequence of observations, and
During the second phase, known as the M-step, given the quantities computed in the E-step, the values of π, B, and A are updated.
A special case of HMM, called Input-Output HMM (IOHMM) is described in Y. Bengio and P. Frasconi, “Input-Output HMM's for Sequence Processing,” IEEE Trans. Neural Networks, 7(5):1231-1249, September 1996. In that case, each observation consists of an input-output pair, and the Baum-Welch algorithm is used to compute the conditional distribution of the outputs given the inputs. The IOHMM is implemented using a pair of classifiers for each of the states in the following model:
a transition classifier that embodies a conditional probability distribution over the states given the input and the state; and
an action classifier that embodies a conditional probability distribution over the outputs given the input and the state.
The Baum-Welch algorithm is a computationally expensive algorithm. In general, it produces triple values π, B, and A that correspond to a local maximum of the likelihood of the data. In general, there are multiple maximums.
A need exists for a method for enhancing the Baum-Welch algorithm, or, more generally, to induce HMMs from data. Such an enhanced algorithm should have an advantage of producing an assignment of observations to states that can provide valuable insights into the structure of the source that generates the sequence of observations.
The present invention teaches a method for inducing Hidden Markov Models (HMMs) from data, based on a similarity function between symbols, such that dissimilar symbols are associated with different hidden states, that has the advantage of making the task of induction less computationally expensive than existing methods. The inventive method includes an advantage of producing an assignment of observations to states that can be used to provide valuable insights into the structure of the source that generates the sequence of observations.
The present invention achieves the following benefits and improvements over the prior art:
1. A dramatic improvement in training time by assigning samples to states based on similarity, which:
2. Improvement over the prior art that requires constructing multiple HMMs and selecting the one with better performance by using a similarity function to enable a simple and effective method for selecting an initial model for the HMM without having to construct the HMM.
3. Providing a natural way of initializing the HMM using representative observations and a distance function, rather than performing a commonly used random initialization or using more complex methods. 4. Improving on existing methods described in H. Singer and M. Ostendorf, “Maximum Likelihood Successive State Splitting,” Proceedings of 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing, vol. 2, pages 601-604 1996, that require splitting each state and selecting the split that maximizes a predefined objective function. This is achieved by using a similarity function to provide a simple method for deciding when to add a new state to the model.
5. Allowing for extension of the inventive methods to different types of HMMs, including IOHMM, while retaining the advantages listed above.
The invention describes a method for inducing a Hidden Markov Model (HMM) using a plurality of training observations and a distance function including assigning at least one representative observation to each of a plurality of hidden states of the HMM; computing a distance between said at least one assigned representative observation and one of said training observation using the distance function, wherein said distance is computed for each assigned representative observation; selecting an initial number of hidden states using the computed distance; initializing the Baum-Welch algorithm using the computed distance; and computing at least one of the E-Step and M-Step of the Baum-Welch algorithm by incorporating said computed distance.
The foregoing and other objects, aspects, and advantages of the present invention will be better understood from the following detailed description of preferred embodiments of the invention with reference to the accompanying drawings in which.
Preferred embodiments of the present invention will now be described in detail herein below with reference to the annexed drawings. In the drawings, the same or similar elements are denoted by the same reference numerals even though they are depicted in different drawings. In the following description, a detailed description of known functions and configurations incorporated herein has been omitted for conciseness.
For the present application, the following definitions are proposed:
(O11, . . . ,O1T1), . . . , (ON1, . . . ,ONTN) is a collection of N sequences of observations, where the observations take value in a set O. It is desired to use that collection of sequences to induce a Hidden Markov Model (HMM) describing their distribution. This collection is called a training set, each sequence is called a training sequence, and each observation is called a training observation.
d(o1, o2) is a distance between pairs of elements of O. The present invention teaches a method for inducing an HMM using the distance function d having the following property:
The distance function d is selected by an expert and is tailored to requirements of the specific application domain where the method is applied.
The number of hidden states may be given, as commonly taught in the art. Alternatively, the number of hidden states may be determined by an appropriate method, for example a method described in A. Barron, J. Rissanen, and B. Yu, “The Minimum Description Length Principle in Coding and Modeling”, IEEE Transactions on Information Theory, 44(6):2743-2670, October 1998 (hereinafter referred to as Barron 98).
In Step 104, E-Step quantities are computed, as described above in the background section of this application. In Step 105, the representative observations are updated. In Step 106, the M-Step quantities are computed by including the similarities computed by the E-Step computation in Step 104. In the present invention these similarities are included directly or in the E-Step quantities used to compute the M-Step quantities. In the present case, the manner in which Step 104 incorporates the similarities computed in Step 103 into the computation of the E-Step quantities is novel and is not taught in the prior art. In Step 107, it is determined whether the algorithm has converged. If the algorithm has converged, it is terminated in Step 108, otherwise the computation continues from Step 103.
In one preferred embodiment of the present invention, the selection of representative observations performed in Step 101 includes randomly assigning a representative observation per each hidden state. In another preferred embodiment, which is described below with reference to
Referring to
In Step 205, the smallest distance computed in Step 204 is compared to a threshold T. In one preferred embodiment, the threshold T may be selected by the user. In another embodiment, where differences are computed based on similarities, the threshold is set to 0 to denote complete dissimilarity. In a third embodiment, where differences are computed based on dissimilarities, the threshold is 0, to denote complete similarity. If, in Step 205, the minimum difference is not greater than the threshold, the process returns to Step 203. Otherwise, if the minimum difference is greater than the threshold, a new state is added to the state space in Step 206, the current training observation is set as the representative observation of the current state in Step 207, and the processing returns to Step 203.
Those of ordinary skill in the art would appreciate that in accordance with the current invention, Step 201 may be a combination of Step 201 described with reference to
The prior art (see Barron 98) teaches selecting the initial number of states by fixing a collection of possible values for the number of states, performing the Baum-Welch algorithm with each of the values in the collection, and comparing the resulting HMMs. The method described in
In a preferred embodiment of the present invention, Step 103 described with reference to
In another embodiment, described below with reference to
where Ot is the observation selected in Step 303, e.g., the observation number t of the sequence, Ox is the representative observation of state x, Oz is the representative observation of state z, and the sum in the denominator is over all states z in the state space.
In Step 305, ξij is computed to be consistent with the values of γx (Step 304). Finally, the initial values of π, B, and A are computed from γx and ξij as taught in the art, in Step 306 and the process repeats starting in step 302. In a preferred embodiment, the Step 306 computation consists of setting Equation (2) as follows:
ξij(t)=γi(t−1)γj(t) (2)
which is consistent with the values of γx computed in Step 303, because the quantities satisfy the following Equation (3):
namely, the desired consistency requirement.
In a preferred embodiment, Step 104 includes incorporating the distances computed in Step 103 into the computation of the E-step quantities, as follows: compute the γj(t) as described in the art; then multiply γj(t) by Equation (4) (which is equivalent to Equation (1)):
and finally normalize the collection γ1(t), . . . , γx(t), where X is the number of states in the state space.
In the preferred embodiment, Step 105 (
In Step 403 the training observation of the training sequence selected in Step 402 is iterated. Step 404 is repeated for each training observation of each training sequence, and consists of computing the conditional probability γ of being the hidden state selected in Step 401. In Step 405, the computed conditional probability γ is compared to the largest value obtained in the previous iterations of Steps 402 and 403 during the current iteration of Step 401. If the value γ computed in Step 404 is larger than the largest value computed so far, the value γ is remembered together with its corresponding observation. Once all iterations of Steps 402 and 403 for the current iteration of Step 401 are concluded, in Step 406, the observation with the largest value of γ computed in Step 405 is assigned to the state selected in Step 401.
It would be apparent to those of ordinary skill in the art that Steps 105 and 106 of
In another embodiment of the present invention, each hidden state includes multiple representative observations. For example, the methods described with reference to
In the preferred embodiment, each representative observation of a hidden state has an associated score, that represents how strongly the observation and the hidden state are associated. For example, those of ordinary skill in the art would appreciate that the values of γ associated with the k selected observations can be used as scores. Then all the algorithms taught in this application for the case where each hidden state has a unique representative observation can be extended to k representative observations by substituting distances with weighted distances, hence the dissimilarity between a state X and an observation is defined in Equation (5) as:
D(X,O)=sxid(Oxi,O) (5)
where Oxi is the ith representative observation of the state x, and sxi is its normalized score, and the sum is taken over the representative observations of state x. In another embodiment, the quantity K(d(Ox,O)) may be replaced by Equation (6)
K(X,O)=sxiK(d(Oxi,O)) (6)
Extention of the methods taught in the present application to either embodiment is understood by those of ordinary skill in the art.
Steps 505 to 508 are repeated for each observation selected in Step 504. In Step 505 the distance between the observation selected in Step 504 and the representative observations of the existing states is computed. In Step 506 the minimum of these distances is compared to a threshold. If the minimum distance is less than or below the threshold, the process returns to Step 504 for selection of the next observation in the trace. Otherwise, the process continues in Step 507, where a new state is added to the HMM. In Step 508, the current observation is set as representative of the new state and the alignment of the sequences is recomputed to account for the new state.
In a preferred embodiment, the HMM is an IOHMM, and the enhancement of the IOHMM using the distance function is called a SimIOHMM, where the prefix “Sim” stands for “Similarity”. Structurally, the SimIOHMM includes the IOHMM, a collection of representative inputs, a distance function between inputs d(.,.), and a finite-support kernel function K(.). Representative inputs, distance, and kernel are incorporated in the Baum-Welch algorithm. Formally, the input-output pair (U,Y) can be substituted for the observation O in the classical Baum-Welch algorithm and the Baum-Welch algorithm can then be obtained for the SimIOHMM.
The E-Step is then formally analogous to that of the classical HMM. The M-step updates three sets of quantities: the initial probability distribution on the states, the state-transition probability distribution A, which in the IOHMM is a mapping from the state space times the input space to the set of probability distributions over the state space; and the conditional probability distribution B of the observations given the states. The chain rule for probability yields the following Equation (7):
bx(u,y)=Q(Y=y|X=x, U=u)Q(U=u|X=x)2 (7)
Both IOHMM and SimIOHMM estimate π using the standard approach, B with the transition classifier learning algorithm, and Q(Y=y|X=x, U=u)$ with the output classifier learning algorithm.
While the IOHMM ignores the term Q{X=x|U=u}, the SimIOHMM in the preferred embodiment of the present invention estimates Q{X=x|U=u} using Bayes' rule to expand, as follows in Equation (8):
The third term of the above equation, Q(X=x), is estimated from the γx(t), the conditional probabilities of being in a state x at a time t given the observed value of a sequence k computed in the previously performed E-Step . The term Q(U=u) can be estimated from the data using the Maximum Likelihood Estimator or other estimators familiar to those skilled in the art. The distinguishing characteristics of the SimIOHMM described in the preferred embodiment of the present invention is the estimation of Q(X=x|U=u), performed by computing the distances between u and the representative input of each state, using these distances as inputs to the kernel function, and combining the results with the Nadaraya-Watson kernel density estimator as follows in Equation (9):
where Ux is the representative input of state x, and the sum in the denominator is over all states z. Other methods of converting similarity between inputs into probabilities can be used in conjunction with the SimIOHMM.
A derivation from the preferred embodiment is an embodiment where the SimIOHMM may use a distance function that uses both inputs and outputs, and the representative samples are input-output pairs would be understood by those of ordinary skill in the art and thus will not be described herein.
The methods of the preferred embodiment of the present invention are used to capture procedural knowledge by performing alignment and generalization of observed procedure traces. Other methods described herein are used to construct autonomic advisors using demonstrations and associating scores to training sequences. These scores are scores for the entire sequence, or scores for the individual observations and can be extended to the case of scored sequences of observations.
While the invention has been shown and described with reference to certain preferred embodiments thereof, it will be understood by those skilled in the art that various changes in form and details may be made therein without departing from the spirit and scope of the invention as defined by the appended claims.