The following relates to the information processing, clustering, density estimation, and related arts.
Two common tasks in information processing are clustering of a set of N objects into K clusters, and density estimation.
In clustering, one has a group of objects each characterized by a set of features (for example, suitably represented as a features vector), and it is desired to divide the objects into K different groups, classes, or clusters. In some approaches, the clustering problem is represented as an optimization problem, in which the log-likelihood function of the form:
is maximized with respect to the weight parameters wk, k=1, . . . , K, subject to the limits:
wk≧0∀k=1, . . . K (2),
and further subject to the normalization condition:
A log-likelihood function such as that of Equation (1) subject to the constraints of Equations (2) and (3) is known to be a concave function, and hence the whole optimization problem maximizing (1) under the constraints (2) and (3) is a convex optimization problem. Therefore, the solution of the problem is unique, which simplifies maximization by avoiding problems due to the presence of problematic local (that is, non-global) maxima. Moreover, some optimization problems formulated as log-likelihood function maximization can be configured to be sparse, meaning that only a small number of the wk parameters are non-zero, a condition which promotes computational efficiency.
In a clustering application, the index n=1, . . . , N indexes N objects in a dataset and the index k=1, . . . ,K indexes K candidate cluster centroids. The candidate cluster centroids may be a subset of the objects to be clustered (K<N), the whole set of objects to be clustered (K=N), a disjoint set of objects, or a mix of the objects to be clustered and of objects belonging to a disjoint set. The parameters pk,n represent the probability that the nth object has been generated by the kth cluster. For example, in one generic formulation pk,n∝exp(−γ∥on−ck|2) may be suitable, where on represents the location of the nth object in a vector space (for example, the features vector space), ck represents the location of the kth candidate cluster centroid in the vector space, ∥. . . ∥ represents a suitable distance measure in the vector space, and γ is a non-negative parameter. In a clustering application, K different candidate clusters ck are defined and the log-likelihood function Θ of Equation (1) is maximized respective to the weight parameters wk, k =1, . . . ,K. Once the optimal wk,k=1, . . . ,K have been identified, the clusters for which the weight parameters wk are strictly positive numbers are well identified clusters, whereas if wk=0, the kth cluster is discarded from the set of candidate clusters. Each object indexed by i, i=1, . . . ,n is assigned (in a probabilistic sense) to one or more of the clusters k=1, . . . ,K using the formula
such that the objects are optimally distributed amongst the clusters.
Density estimation is an application closely related to clustering. In density estimation, it is desired to estimate a Probability Density Function (PDF) that is representative of the distribution of a group of objects or data points. In some density estimation approaches, the PDF is represented as a linear combination of K constituent functions. In these approaches, a log-likelihood function such as of the form given in Equation (1) is again used, but here with the interpretation that the parameters pk,n represent the degree to which the nth object or data point lies within the kth PDF component, and the weight parameters wk,k=1, . . . ,K are the relative weights of the K constituent PDF components in the linear combination. By maximizing the log-likelihood function Θ of Equation (1) respective to the weight parameters wk,k=1, . . . ,K, the PDF defined by the linear combination is optimized to best represent the distribution of the N objects or data points.
While clustering and density estimation are two useful applications of the log-likelihood function Θ of Equation (1), numerous other applications exist. For example, log-likelihood functions find application in information entropy-related problems, maximum likelihood problems, and so forth.
Accordingly, there is substantial technological value in developing computationally efficient methods for maximizing log-likelihood functions. A commonplace approach for maximizing a log-likelihood function is the iterative expectation-maximization (EM) algorithm. However, the speed of convergence of EM for log-likelihood maximization is relatively slow. Convergence speed can be enhanced by setting to zero any wk falling below a selected threshold (such as below 10−3/N). See, e.g., Lashkari et al., “Convex clustering examplar-based models”, NIPS (2007) (available at http://people.csail.mit.edu/polina/papers/LashkariGolland_NIPS07.pdf, last accessed Aug. 14, 2008), which is incorporated herein by reference in its entirety. However, the EM convergence is still relatively slow even with this enhancement. Other approaches for log-likelihood function maximization include various least-squares optimization techniques such as gradient-based approaches. However, these techniques typically also suffer from various deficiencies such as slow convergence, computational complexity, or so forth when applied to log-likelihood maximization.
In some illustrative embodiments disclosed as illustrative examples herein, a method performed by an electronic processing device is disclosed, the method comprising: selecting a pair of parameters from a plurality of adjustable parameters of a concave log-likelihood function; maximizing a value of the concave log-likelihood function respective to an adjustment value to generate an optimal adjustment value, wherein the value of one member of the selected pair of parameters is increased by the adjustment value and the value of the other member of the selected pair of parameters is decreased by the adjustment value; updating values of the plurality of adjustable parameters by increasing the value of the one member of the selected pair of parameters by the optimized adjustment value and decreasing the value of the other member of the selected pair of parameters by the optimized adjustment value; and repeating the selecting, maximizing, and updating for different pairs of parameters to identify optimized values of the plurality of adjustable parameters.
In some illustrative embodiments disclosed as illustrative examples herein, a storage medium is disclosed that stores instructions executable by an electronic processing device to perform a method comprising: selecting a pair of parameters wi,wj from a set of K adjustable parameters of a log-likelihood function having the form
maximizing a value of the log-likelihood function incorporating a change δ to the selected pair of parameters wi,wj of the form
respective to the parameter −wi≦δ≦wj to generate an optimal value for the change δ; replacing (wi)new←(wi)old+δ and (wj)new←(wj)old−δ; and repeating the selecting, maximizing, and replacing for different pairs of parameters wi, wj of the set of K adjustable parameters to identify optimized parameter values for the set of K adjustable parameters.
In some illustrative embodiments disclosed as illustrative examples herein, a system is disclosed, comprising one or more electronic processors configured to perform a concave log-likelihood function maximization process defined by the following operations: selecting a pair of parameters from a plurality of adjustable parameters of a concave log-likelihood function; maximizing a value of the concave log-likelihood function respective to an adjustment value to generate an optimal adjustment value, wherein the value of one member of the selected pair of parameters is increased by the adjustment value and the value of the other member of the selected pair of parameters is decreased by the adjustment value; updating values of the plurality of adjustable parameters by increasing the value of the one member of the selected pair of parameters by the optimized adjustment value and decreasing the value of the other member of the selected pair of parameters by the optimized adjustment value; and repeating the selecting, maximizing, and updating for different pairs of parameters to identify optimized values of the plurality of adjustable parameters; and further configured to perform a task comprising clustering or generating a probability density function representative of a set of objects or data points, the task being performed by (i) generating a task-representative concave log-likelihood function, (ii) invoking the concave log-likelihood function maximization process respective to the task-representative concave log-likelihood function, and (iii) based on the maximized concave log-likelihood function associating the objects or data points with clusters or generating the probability density function representative of the set of objects or data points.
As used herein, the term “log-likelihood function” is intended to encompass any function embodying the logarithm of a likelihood of the form given in Equation (1), but is not intended to be limited to a probabilistic interpretation which does not involve likelihood of probabilities. The following relates to any mathematical formulation that has the form of Equation (1) or trivial transformations or variations of it. For example, the log-likelihood function can be the function of Equation (1), such as:
Or when the normalization constraints (2) and (3) are removed using a softmax transformation:
The index k that indexes the w parameters is optionally multidimensional, for example in the following illustrative log-likelihood function:
The log-likelihood functions set forth in Equations (1) and (4)-(6) are illustrative examples, and the term “log-likelihood function” as used herein is intended to encompass all these variants set forth herein as well as other variant log-likelihood formulations. The illustrative log-likelihood maximization techniques disclosed herein are disclosed with respect to the illustrative log-likelihood function of Equation (1); however, the skilled artisan can readily adapt these techniques to any of the log-likelihood functions of Equations (4)-(6). Each of the log-likelihood functions of Equations (4)-(6) are of the same form as the log-likelihood function of Equation (1), and differ only insubstantially from Equation (1) in terms of the choice of parameter indexing (or lack thereof), and/or the choice of normalization factor (or lack thereof), and/or so forth.
The log-likelihood functions described herein as illustrative examples are concave log-likelihood functions. A log-likelihood function has adjustable parameters, and is representative of dependence of a likelihood upon those adjustable parameters, and has a single global maximum within a domain of interest of the adjustable parameters. The disclosed techniques for maximizing illustrative log-likelihood functions are expected to be generally applicable to maximizing any concave log-likelihood function respective to its adjustable parameters. The terms “maximization of the log-likelihood function”, “log-likelihood maximization”, and similar phraseology as used herein is intended to denote maximization of a log-likelihood function such as any of those of Equations (1 ) or (4)-(6) under the constraints on the w parameters set forth in Equations (2) and (3). Additional constraints or restrictions can be applied during the maximization, such as setting an upper limit on the w parameters, or constraining some w parameters to predetermined fixed values such as may be dictated by a priori information regarding the problem whose solution entails the log-likelihood maximization.
With reference to
As used herein, the terms “optimize”, “maximize”, and similar phraseology is intended to be broadly construed to encompass not only an absolute optimum or an absolute maximum, but also a value that is close to, but not precisely, the global optimum or maximum. For example, an iterative process may be used to optimize the log-likelihood function respective to the parameters wk. In doing so, the iterative algorithm may be terminated based on stopping criteria that causes the algorithm to stop the optimization at a point at which the log-likelihood function is not yet at the absolute global maximum. Such optimization is said to optimize the log-likelihood function respective to the parameters wk, even though the final value of the log-likelihood function may not be the absolute largest value attainable by adjustment of the parameters wk.
The disclosed log-likelihood maximization techniques employ maximization of the log-likelihood function respective to successive selected pairs of the adjustable parameters. The inventors have found that this approach provides substantially improved convergence times as compared with existing computationally intensive techniques such as expectation-maximization (EM), while providing comparable performance in terms of identifying a set of the adjustable parameters that maximizes the log-likelihood function.
The disclosed techniques of maximization of the log-likelihood function respective to successive selected pairs of the adjustable parameters are more generally applicable to maximization of any concave log-likelihood function. That is, any concave log-likelihood function is expected to be maximized in a computationally efficient manner by performing the maximization as respective to successive selected pairs of the adjustable parameters. For convenience, the maximization technique is described with reference to the log-likelihood function of Equation (1), but is expected to be generally applicable for any concave log-likelihood function.
With continuing reference to
Moreover, the constraint of Equation (2), namely wk≧0∀k=1, . . . ,K, further limits the possible range of values of the adjustment value δ. The requirement wi+δ≧0 requires −wi≦δ. The requirement wj−δ≧0 requires that δ≦wj. Combining these two conditions produces the following bounds for the adjustment value δ:
−wi≦δ≦wj (13).
Optionally, further bounds may be imposed on the adjustment value δ, based on further constraints that the problem being solved may impose on the adjustable parameters wk.
With continuing reference to
which can be further rearranged as follows:
The optimizer 24 suitably maximizes the adjusted log-likelihood function of Equation (15) respective to the adjustment value δ, subject at least to the constraint −wi≦δ≦wj, and without adjusting any of the adjustable parameters wk,k=1, . . . ,K.
It is believed that there is no closed form solution for the adjustment value δ that maximizes the log-likelihood function of Equation (8). However, the problem is a one-dimensional maximization problem that has both lower and upper bounds imposed by Equation (7). Accordingly, substantially any maximization algorithm can be used, such as a gradient descent method. In some embodiments, an iterative Newton-Raphson algorithm is employed by the optimizer 24 to maximize the log-likelihood function of Equation (8) respective to the adjustment value δ. In this approach, the following iterations are performed (where the index t denotes the iterations, with δt being the current value and δt+1 being the updated value):
where:
The iterative Newton-Raphson algorithm can suitably be initiated by setting δt−0=0. Various termination criteria can be employed. In one suitable approach, the iterative Newton-Raphson maximization is terminated when any of three conditions are met: (1) a value of δt+1 is reached which would violate the bounding condition −wi≦δ≦wj; (2) the algorithm has converged, for example |δt+1−δt<Theshold where Theshold is a (typically relatively small) positive threshold value; or (3) when the number of iterations reaches a predetermined maximum. For the termination condition (1), various remedial operations can be used to select the final value of δ so as to satisfy the bounding condition −wi≦δ≦wj. For example, in one approach if δ<−wi then the final value is set at δ=−wi (which has the effect of zeroing the adjustable parameter wi), while if δ>wj then the final value is set at δ=wj (which has the effect of zeroing the adjustable parameter wj).
The optimizer 24 outputs the final value for the adjustment value δ. A parameters updater 26 updates the selected parameters wi,wj by replacing the current value of wi by the new value wi+δ and replacing the current value of wj by the new value wj−δ. A repeating operator 28 causes the selection 20, the δ optimization 24, and the parameter pair updating 26 to repeat until a selected stopping condition is met. One suitable stopping criterion is based on fractional change of the value of the log-likelihood function, for example:
where here index t denotes iterations caused by the repeating operation 28. A more complex stopping criterion is based on the recognition that at the precise global maximum of the concave log-likelihood function
for all possible parameter pairs. Referring to Equation (10), it can be seen that this condition is met if and only if:
Denoting
a suitable convergence criterion is maxi,j|βi−βj|<threshold.
Performance of the concave log-likelihood function maximizer 10 is dependent on the algorithm employed by the parameter pairs selector 20 to select successive parameter pairs. In general, it is desired that the selection “cycle through” the K parameters, or at least those of the K parameters having non-zero values, in an efficient manner so that after a few K iterations or less it is ensured that at least all non-zero parameter values have been updated. Moreover, it is desired that each parameter (or at least each non-zero parameter) be occasionally paired with each other parameter (or at least each other non-zero parameter), to ensure that all possible value tradeoffs between the various possible pairs (i,j) are efficiently explored.
In one suitable pairs selection approach, the index i cycles deterministically through all possible values. For example, in successive repetitions caused by the repeater 28, the value of the index i can follow the deterministic sequence i =1,2,3, . . . ,K,1,2,3, . . . ,K. For each value of the index i, the index j is selected randomly from all available values k=1, . . . ,K,k≠i. This approach is expected to be sub-optimal in that it does not provide the most efficient pairs selection to reach maximization of the concave log-likelihood function in the fewest number of iterations. However, the selection approach is computationally efficient and has been found to provide good convergence in practice.
In other suitable pair selection approaches, the selection strategy is tailored to enhance the likelihood that a repetition caused by the repeater 28 will produce a relatively large increase in the value of the concave log-likelihood function. One way to do this is to bias the selection toward selecting a pair of adjustable parameters having large values compared with parameters of the plurality of adjustable parameters that are not selected by the selecting. Typically, larger parameter values contribute more to the value of the concave log-likelihood function than smaller values. One suitable selection approach that provides such weighting is as follows. Performing a first order expansion of the adjusted log-likelihood function of Equation (15) with respect to the adjustment value yields:
which can be approximated as:
Denote again
If a pair (i,j) has a high value |βi−βj| then it is likely to yield a relatively large increase in the log-likelihood function. Accordingly, at each iteration the selector 20 suitably selects the pair (i,j) that yields the maximum value for |βi−βj|. This selection approach is likely to provide convergence with fewer iterations; however, the selection approach is relatively computationally complex.
For some applications, it is expected that the concave log-likelihood function will be relatively sparse, by which it is meant that many, and perhaps most, of the adjustable parameters wk will be zero. Accordingly, in some embodiments any adjustable parameter whose value becomes zero is no longer treated as adjustable. In other words, in such embodiments once an adjustable parameter goes to zero it is excluded from further selection as a member of the pair of parameters (i,j). Since this can result in erroneous results if the parameter should in fact be non-zero at the global maximum, in some embodiments a parameter having zero value may be kept in the cycling of parameter pair samplings (i,j) for a selected number of iterations or until the concave log-likelihood function appears to be close to convergence, after which time parameters that go to zero are excluded. The phraseology “go to zero” and the like in some embodiments is construed as going below a selected threshold value. In some embodiments, it is contemplated to set a parameter that goes below a selected threshold value identically to zero.
The disclosed concave log-likelihood function maximizer 10 or its substantive equivalents can be used in various applications. Illustrative clustering and density estimation applications are described with reference to
With reference to
where f(xn;mk) is an exponential family distribution on a random variable, the set {mk}k=1K represent the centroids of the K clusters, and the set {qk}k=1K represent the K adjustable parameters indicating the mixture weights of the K clusters. The log-likelihood function of Equation (18) serves as the input log-likelihood function 12 that is input to the concave (in this case, log−) likelihood function maximizer 10 to generate the optimized log-likelihood function 14 having or defining the optimized values for the set of mixture weights {qk}k=1K. These values are used by a clusters assignor 44 to assign one or more clusters to each object of the set of objects 40 (corresponding to the objects {x}n=1N in the formulation of Equation (18). The cluster assignments can be used in various ways. For example, a clusters renderer 46 can plot the objects color-coded by cluster membership or using another type of rendering on the display 6 to enable a human to review the clustering assignments and, optionally, to manually correct any clustering assignments the human user decides are incorrect or non-optimal.
With reference to
where D is the dimensionality of the space of interest (that is, the dimensionality of the points xn, the superscript T represents the transpose operator, and the operator | . . . | is the determinant operator. The component pk,n of Equation (19) can be included in the log-likelihood functions set forth here (for example, Equation (1)) to generate the input log-likelihood function 12 that is maximized by the concave (in this case, log−) likelihood function maximizer 10 to generate the optimized log-likelihood function 14 having or defining the optimized values for the set of weights wk defining the mixture weights for the Gaussian components of Equation (19). An optimized PDF 54 is suitably constructed as a linear combination of the K Gaussian components each given by Equation (19) and each weighted by the corresponding optimized weights Wk determined by the likelihood function maximizer 10. The optimized PDF 54 can be used in various ways. For example, a PDF renderer 56 can plot the objects together with the optimized PDF 54 represented by grayscale shading or using another type of rendering on the display 6 to enable a human to visually review how well the optimized PDF matches the set of objects or data points 50.
It will be appreciated that various of the above-disclosed and other features and functions, or alternatives thereof, may be desirably combined into many other different systems or applications. Also that various presently unforeseen or unanticipated alternatives, modifications, variations or improvements therein may be subsequently made by those skilled in the art which are also intended to be encompassed by the following claims.
Number | Name | Date | Kind |
---|---|---|---|
6327581 | Platt | Dec 2001 | B1 |
20060098871 | Szummer | May 2006 | A1 |
Number | Date | Country | |
---|---|---|---|
20100088073 A1 | Apr 2010 | US |