Prediction trees can include a type of decision tree used in machine learning and data mining applications, among others. A prediction tree can be a decision tree in which each node has a real value associated with it, in addition to a branching variable as in a conventional decision tree. Prediction trees may be built or learned by using a first set of training data, which is then used to construct the decision and prediction values. A tree may be then applied against a second set of validation data, and the results are used to fine-tune the tree. Various computer-implemented techniques are known for growing and applying prediction trees to arbitrary data sets.
Conventional techniques for building prediction trees include two phases: a growing phase and a pruning phase. In the growing phase, nodes are added to the tree to match a known set of data, such as a training set. During this phase the tree may be overgrown, often to the point of fitting some noise in the data as well as real trends and patterns in the data. In an extreme case, for example, a tree can be constructed for a set of data in which each data point is associated with an individual leaf, i.e., the tree is fit exactly to the data set so that no two examples or data points result in the same end leaf or path through the tree. In some cases, such an overgrown tree may exactly fit known data, but could be ineffective or useless at predicting outcomes for other examples or data points.
To avoid the problem of overgrowing a tree, a second pruning phase may be employed in which sections of the tree that provide little or no additional predictive power are removed or collapsed. For example, a portion of the tree that fails to distinguish further among most of the examples that lead to that portion of the tree may be removed, thus terminating that portion of the tree at a higher node. Various pruning and validation techniques are known. For example, validation data may be applied to the tree to determine whether the tree provides equivalent or better predictions in the absence of certain nodes. Such nodes may then be pruned from the tree. Generally, the two-step growing and pruning process is computationally expensive.
Various other additions to tree learning are known. Some tree learning and application techniques associate a prediction with internal nodes of prediction trees; such techniques have been used for the estimation and learning of context trees for compression and classification. Measure-based regularization of prediction trees has been used to penalize a Hilbert norm of the gradient of a prediction function ƒ. Some tree growing techniques have made use of self-controlled learning for online learning of self-bounded suffix trees. The learning procedure can be viewed as the task of estimating the parameters of a prediction tree of a fixed structure using the hinge loss for assessing the empirical risk along with an l2-norm variation penalty. In the context of online learning, this setting may lead to distilled analysis that implies sub-linear growth of the suffix tree. However, such approaches may not migrate directly to other settings. Various Bayesian approaches have also been used for tree induction and pruning.
According to an embodiment of the disclosed subject matter, a computer-implemented method of constructing a self-terminating prediction tree may include constructing a piecewise-continuous function representative of a prediction tree that maps an input space to real prediction values, determining a complexity function for the prediction tree based upon the variation norm of the real-valued prediction values, where the complexity function includes a regularizer that indicates when each child of a node should not be grown, and constructing a weighted risk function based upon the piecewise-continuous function. A variable that minimizes a combination of the complexity function and the weighted risk function for a root node may be identified, and a real value for each child node of the root node determined. The combination of the complexity function and the weighted risk function for each child node may be minimized, so as to obtain a real value for each child node of the child node. An input that includes a request for a prediction of a real value may be received from a user, and the tree may be traversed to obtain the requested prediction.
In an embodiment of the disclosed subject matter, a computer-implemented method of constructing a self-terminating prediction tree may include determining a complexity function for the prediction tree, constructing a weighted risk function for the prediction tree, and minimizing a combination of the complexity function and the weighted risk function to obtain a real-valued prediction for a plurality of nodes in the tree, where nodes having a real-valued prediction of zero are not added to the tree.
A system according to an embodiment of the disclosed subject matter may include a processor configured to construct a piecewise-continuous function representative of a prediction tree, where the function maps an input space to real prediction values, determine a complexity function for the prediction tree based upon the variation norm of the real-valued prediction values, that includes a regulator to indicate when a node should not be grown, and construct a weighted risk function based upon the piecewise-continuous function. The processor may determine a variable that minimizes a combination of the complexity function and the weighted risk function for the root node, determine a real value for each child node of the root node, and, for each child node of the root node having a non-zero real value, minimize the combination of the complexity function and the weighted risk function for the child node to obtain a real value for each child node of the child node. The system also may include an input configured to receive a request for a prediction of a real value based upon the prediction tree from a user, and an output configured to provide a prediction obtained by traversing the tree based upon the request.
A system according to an embodiment of the disclosed subject matter may include a processor configured to determine a complexity function for a prediction tree, construct a weighted risk function for the prediction tree, and minimize a combination of the complexity function and the weighted risk function to obtain a real-valued prediction for a plurality of nodes in the tree. Nodes in which the optimization method yields no change in the real-valued prediction relative to the parent need not be added to the tree.
In embodiments of the disclosed subject matter, methods and systems as disclosed above may be implemented on or in conjunction with a computer-readable medium that causes a processor to perform the disclosed methods and/or to implement the disclosed systems.
Additional features, advantages, and embodiments of the disclosed subject matter may be set forth or apparent from consideration of the following detailed description, drawings, and claims. Moreover, it is to be understood that both the foregoing summary and the following detailed description are exemplary and are intended to provide further explanation without limiting the scope of the claims.
The accompanying drawings, which are included to provide a further understanding of the disclosed subject matter, are incorporated in and constitute a part of this specification. The drawings also illustrate embodiments of the disclosed subject matter and together with the detailed description serve to explain the principles of embodiments of the disclosed subject matter. No attempt is made to show structural details in more detail than may be necessary for a fundamental understanding of the disclosed subject matter and various ways in which it may be practiced.
It has been found that decision/prediction trees may be more efficiently created, without requiring a separate pruning phase, by using self-terminating prediction trees (SPTs) as disclosed herein. Self-terminating prediction trees are a generalization of decision trees in which each node is associated with a real-valued prediction. Instead of having a separate pruning phase, a self-terminating tree may be constructed by applying various limits during tree growth that prevent nodes that add little or no additional decision power from being grown within the tree. For example, a parent node that would only have a single child node that provides little or no additional information relative to the parent's real-value prediction value may not be grown.
In general, any tree or tree structure that could be created using a conventional growing/pruning technique also may be created using embodiments of the disclosed subject matter. However, whereas growing/pruning techniques normally expand either all children no children of a node in the tree, embodiments of the disclosed subject matter allow for development of the same or equivalent tree structures directly during tree growth.
According to an embodiment of the disclosed subject matter, an SPT can be viewed as a piecewise-constant function from an input space into a set of real values. Therefore, the children of a node in an SPT split the portion of the input feature space that is defined by the parent node into disjoint partitions, where each of the partitions is associated with a different prediction value. The complexity of the tree may be measured by the variation norm of the piecewise-constant function it induces.
SPTs may be applied to obtain prediction values for base inputs, such as prediction request and/or initial data supplied by a user of a system configured to generate and/or use the SPT. A base prediction for an input instance is formed by summing the individual predictions at the nodes traversed from the root node to a leaf by applying a sequence of branching predicates. The final predicted value may be obtained by applying a transfer function to the base prediction. For example, in the context of a probabilistic classification, a suitable transfer function may be the inverse logit function 1/(1+e−x). As another example, for a least squares regression the identity may be used as a suitable transfer function.
According to an embodiment, the logical problem of learning the prediction tree, such as by a computerized process, may be cast as a penalized empirical risk minimization task, based upon the use of prediction values and functional tree complexity described above. For instance, for prediction trees with the inverse logit transfer, a natural choice for the risk is the log-likelihood of the examples. Variation penalties based on l1 and l∞, norms may be used. It has been found that these norms may promote sparse solutions that, in the context of SPTs, correspond to self-terminating of a tree-growing phase, meaning that no separate pruning phase is required. These norms also may facilitate parameter estimation of the prediction values.
Embodiments of the presently disclosed subject matter may be “backward compatible” with existing tree learning procedures. That is, other tree learning procedures may be used, and caused to self-terminate using the techniques disclosed herein. Efficient tree growing algorithms may be derived for a variety of loss functions, including some non-convex losses such as the difference of hinge functions, which may provide a tighter bound to the 0-1 loss.
For example, upon omitting the variation penalty, techniques disclosed herein may provide other growing criteria such as the information gain and the Gini index.
In an embodiment of the disclosed subject matter, an optimization method employing a dual representation of the (primal) penalized risk may be used, which may enable a unified treatment of different variational norms through their dual norms. A combined primal-dual procedure also may provide an algorithmic skeleton independent of the empirical loss.
Embodiments of the presently disclosed subject matter may diverge from conventional tree construction methods, which require two uncoupled phases of growing and then pruning the tree. The fact that the growing/induction phase is divorced from the pruning phase, poses aesthetic and computational challenges since two-phase tree induction methods often grow trees beyond the size necessary and, in some cases, over-grow the tree and result in fitting to noise in addition to data trends.
As disclosed above, a prediction tree is a generalization of a decision tree in which each node s is assigned a predicate πs that is used for branching, as well as a real value αs.
The use of techniques disclosed herein in binary predictions will now be described.
For any node s in the prediction tree, the path Ps(x) is defined as the path of nodes from the root node to the node s when evaluating x. The sum of real values bs along the path is given by bs=ΣiεP
For a given prediction tree T, the norm variation complexity VP(T) is defined as ΣsεTλ(s)∥αC(s)∥p, where C(s) is the set of children of the node s and λ(s) is a penalty for node s, e.g., the depth of node s. By convention, the real value α is set to 0 for null children. Thus, for p=1 and p=∞:
and
where {tilde over (λ)}(s) is the penalty for the parent of node s. The penalties λ(s) and {tilde over (λ)}(s) may be used to encourage small decision trees. In general, the regularization constant λ provides a control for the degree of sparsity of the prediction tree. For example,
The use of the l∞ regularizer above may provide a sparse solution, in which children C(s) of a node s are zero. If the optimal solution is such that at least some αs′ for s′εC(s) is non-zero, then the rest of the children can be non-zero as well without incurring further penalty.
As disclosed above, in an embodiment of the disclosed subject matter the tree learning process can be performed as a penalized empirical risk minimization task. To do so, the tree is modeled as a piecewise-continuous function and a risk function is applied. For example, a function ƒT may be defined for a prediction tree T. As a specific example, for an input x, ƒT(x) may be the sum of the α values along the path from the root of the tree T to the leaf reached by x. An empirical risk function {circumflex over (R)}(L, F, w) may be defined for the function ƒ with loss L weighted by w≧0. Given examples xi and labels
Then the goal is to minimize the penalized weighted empirical risk (Equation 1):
Equation 1 incorporates sparsity-promoting regulation and, therefore, the learning technique encourages small trees that naturally terminate growth.
This technique greedily builds a multivariate prediction tree, but does not require a separate pruning phase as with conventional trees. Further, any“pruning” occurs at the finer granularity of edges, rather than at nodes. Because each node has an associated prediction, the value may be applied upon reaching a null child. The variable that minimizes Equation 1 by itself may be placed in the root, and then the same procedure may be recursively applied to all added nodes.
The optimization procedure used to select the variable to place at the node simultaneously determines the value αj for each of the branches defined by the selected variable. For each branch for which s αs is non-zero, the process is recursively applied. That is, embodiments of the presently disclosed subject matter may learn a prediction tree by first determining a variable that minimizes a combination of the complexity function and the weighted risk function at a root node, which also provides a real value for each child node of the root node. Similarly, these techniques may then determine a variable that minimizes a combination of the complexity function and the weighted risk function for each child node having a non-zero real value, which provides a real value for each child node of the root node. The process may be recursively applied for each child level having at least node with a non-zero real value.
Notably, the regularizer used in the objective determines when to stop growing the tree, i.e., when the tree will self-terminate. Furthermore, the regularization constant λ provides a control for the degree of sparsity for the prediction tree, as shown and described with respect to
In general, as disclosed above, embodiments of the presently disclosed subject matter may be considered as including several components: associating each node of a prediction tree with a confidence value and a real-valued prediction, and learning the tree by minimizing a penalized empirical risk function. The risk function may be applied, for example, to a piecewise-continuous model of the tree. The complexity measure of the tree may be defined as the variation of the real-valued predictions. As disclosed in further detail below, various loss functions may be used with the penalized empirical risk tree learning technique.
In embodiments of the presently disclosed subject matter, node expansion may be performed through a variety of techniques. Techniques for learning sparse real α values for a node's children according to an embodiment of the disclosed subject matter will now be described. The predicate π to use within a node s is chosen by greedily selecting the predicate minimizing the penalized loss (1). More specifically, the loss obtained when s is associated with a k-ary predicate π may be derived, which in turn may create k children with values α1, . . . , αk. In the following description, wij is set equal to wi when example i follows branch j, and to 0 otherwise, and b=bs(x) where x is the example being considered.
In embodiments of the presently disclosed subject matter, techniques for addressing classification problems having labels in {−1, 1} for a variety of loss functions may be used. For the following description, the values μj and νj are defined as:
μj=Σy
νj=Σy
For the logistic loss case, L(ƒ(x), y)=log(1+e−yƒ(x)). To expand a node s into k children based upon splitting a particular feature, the following is minimized for αε′:
In terms of ν and μ, this becomes (Equation 2):
It can be shown that this generalizes a conventional greedy tree building using information gain by first determining the dual of Equation 2. H is used to denote the binary entropy and 1/p+1/q is set equal to 1, so that lq is dual to lp. The dual problem to Equation 2 is then given by (Lemma 1)
Given the optimal dual variable γ, the optimal α is αj=log [(μj−γj)/(νj+γj)]−b. Notably, when γj=0 for all j, this objective reduces to a standard information gain.
A general-purpose solution for the dual case may be obtained as described herein. In some embodiments of the presently disclosed subject matter, it may be useful to use a primal-based algorithm for an l1 regularizer. To do so, the sub-gradient of Equation 2 is determined with respect to αj and set to 0. So
where
Thus, rj and αj are:
The closed-form solution for αj requires knowledge of sj; however, the sign of αj may be determined from known quantities:
when αj>0 and setting sj=1, (μj−λ)/(νj+λ)>eb;
when αj<0 and setting sj=−1, (μj+λ)<(νj−λ)<eb;
when αj=0, −1≦sj≦1;
when −1≦sj, eb≦(μj+λ)/(νj−λ); and
when sj≧1, (μjλ)/(νj+λ)≦eb
Thus, sj and αj can be determined based upon only the known quantities μ, ν, λ, and b. A more complex method may be applied to all convex losses and for both l1 and l∞ regularizers, as described in further detail herein.
Another loss of interest in an embodiment of the disclosed subject matter may be the hinge loss, for which L(ƒ(x), y)=max {0, 1−yƒ(x)}. The resulting optimization problem is then:
For p=1, the loss is piecewise-linear, so the objective may be determined at the three inflection points α=0, α=1−b, and α=−1−b. The objective values may then be compared to find the minimum. The dual approach described herein may also be used, such as when p=∞.
In an embodiment of the disclosed subject matter, another loss of interest may be the difference of hinge loss. For this loss function, L(ƒ(x), y)=dh(ƒ(x) y), where dh is the difference of a hinge function at 0 and a hinge function at −1, and is defined as dh(z)=max {0, 1−z}−max {0, −z}. The associated optimization problem is
For this loss, a solution may be obtained using the primal for both p=1 and p=∞. When p=1, the loss is piecewise-linear as with the hinge loss, allowing for the objective to be determined at the inflection points α=0, α=1−b, α=−1−b, and α=−b. Similarly, when p=∞, the loss is piecewise-linear with inflection points at αj=0, for αjε{−r, r}, where r=min(|1−b|, |1+b|) and αjε{1−b,−(1+b)} for all j.
In an embodiment of the disclosed subject matter, an exponential loss function may be used. For exponential loss, L(ƒ(x),y)=exp(−ƒ(x) y), so the objective function is
For p=1, setting the sub-gradient with respect to αj to zero yields
−μiexp(−(αi+b))+νiexp(αi+b)+λsign(αi)=0.
The equation is a second-order polynomial in eα and the solution is the root of the equation. Just as the logistic loss generalizes a standard information gain measure for tree growing, the exponential loss generalizes the Gini index.
In an embodiment of the disclosed subject matter, a squared loss may be applied for regression problems (where yε). The squared loss is L(ƒ(x), y)=½(f(x)−y)2. The technique then attempts to find α that minimizes
where C is a constant independent if α. Defining
and
gives the equivalent
The saddle point for αj is defined by wj(αj+b)−νj+λsign(αj)=0. So when αj>0, at the saddle point, αj=(νj−λ)/μj−b. This occurs if and only if (νj−λ)/μj−b>0, or equivalently when
νj/μj−b>λ/μj.
Similarly, when αj<0, αj=(νj+λ)/μj−b, if and only if (νj+λ)/μj−b<0, or equivalently when
νj/μj−b<−λ/μj.
If neither conditions hold, then αj=0.
As previously indicated, in an embodiment of the disclosed subject matter a generalized solution to solving a dual optimization method for both classification and regression may be used.
As disclosed above, a variety of loss functions may be used, including convex loss functions.
In an embodiment of the presently disclosed subject matter, the standard University of California-Irvine (UCI) data sets as commonly used in the field were used to grow and test a self-terminating tree. The results obtained with this embodiment demonstrate that the self-terminating tree techniques disclosed herein provide results competitive with a sophisticated Cart implementation that uses validation data in a post-pruning process. In contrast, embodiments of the presently disclosed subject matter allow for trees to self-terminate during the growing phase, with validation data only needed to select the value of 2. To obtain a standard deviation, the standard UCI training data was used, with ⅙ of the training data used as test data. The remaining ⅚ was provided as training data (with a fraction set aside as designated by the algorithm for cross validation). The classification results were averaged over 200 repetitions of this process, and the results for regression averaged over 50 repetitions. The results are shown below:
This shows that classification errors obtained by embodiments of the presently disclosed subject matter are comparable to known Cart results. As an alternate view,
One reason to consider the hinge loss and/or the difference of hinge loss is that these both better approximate the 0-1 loss, and as such should be more robust to classification errors.
Since SPTs use empirical risk minimization with respect to a real-valued prediction associated with each node in the tree, it would be expected that as with minimizing the log loss, the techniques disclosed herein will perform well for regression as compared with Cart. The following table shows a comparison between Cart and SPTs according to embodiments of the presently disclosed subject matter using the squared loss with au L1 regularizer. As expected, SPTs according to embodiments of the presently disclosed subject matter may significantly outperform Cart on these data sets.
Embodiments of the presently disclosed subject matter also may be extended and generalized to multiclass problems. For example, some embodiments of the presently disclosed subject matter may provide techniques to solve multiclass problems using an l1 regularizer at the node level. Using this restriction, an estimation procedure for each child of a node may be individually performed. A derivation of an example multiclass technique and solution according to an embodiment of the presently disclosed subject matter is disclosed in the appendix provided herewith.
Embodiments of the presently disclosed subject matter may be used to construct and use self-terminating trees in a variety of contexts. For example, self-terminating trees may be used to automatically classify or rank various items within a computer system. Specific examples include assigning a likelihood that a file is corrupt, identifying a desired file or component, ranking cost or value of a set of items, attributes, or conditions, assigning a probability that a user's provided identity is correct, determining a likelihood that a security measure has been breached, and the like, as well as various other ranking and/or classification applications. In these configurations, the real value at each node may provide, for example, an indication of whether a user is likely to perform a specific action, if an analysis of the user's history or attributes leads to that node of the tree. Each node may indicate an attribute the user may have, the value of which for the particular user indicates which branch or path through the tree should be followed. Thus, by applying a tree to a particular user, file, configuration, message, etc., the tree may provide a prediction that the user's data is inaccurate, that the file is corrupt, or the like.
The tree structure shown in
In a growth/pruning-type technique, these nodes may then be pruned based upon the performance of the full tree when applied to validation data. For example, the validation data may show that the additional branches 1620-1660 provide little or no additional predictive power, or that a tree without one or more of these branches performs better than the fully-grown tree that includes these branches. Thus, the branches 1620-1660 may be removed from the tree, resulting in a similar or identical tree to that obtained by an SPT technique as disclosed herein. The additional growth of branches that are later pruned 1620-1660, causes computational inefficiencies, especially for larger trees and data sets. Thus, embodiments of the disclosed subject matter may provide improved processing time relative to growth/pruning-type techniques for tree growth.
Embodiments of the presently disclosed subject matter may be implemented in and used with a variety of device and network architectures.
The bus 212 allows data communication between the central processor 214 and the system memory 217, which may include read-only memory (ROM) or flash memory (neither shown), and random access memory (RAM) (not shown), as previously noted. The RAM is generally the main memory into which the operating system and application programs are loaded. The ROM or flash memory can contain, among other code, the Basic Input-Output system (BIOS) which controls basic hardware operation such as the interaction with peripheral components. Applications resident with the computer system 200 are generally stored on and accessed via a computer readable medium, such as a hard disk drive (e.g., fixed storage 224), an optical drive, floppy disk, or other storage medium 237.
The fixed storage 224 may be integral with the computer system 200 or may be separate and accessed through other interface systems. The network interface 208 may provide a direct connection to a remote server via a telephone link, to the Internet via an internet service provider (ISP), or a direct connection to a remote server via a direct network link to the Internet via a POP (point of presence) or other technique. The network interface 208 may provide such connection using wireless techniques, including digital cellular telephone connection, Cellular Digital Packet Data (CDPD) connection, digital satellite data connection or the like.
Many other devices or subsystems (not shown) may be connected in a similar manner (e.g., document scanners, digital cameras and so on). Conversely, all of the devices shown in
Various embodiments of the presently disclosed subject matter may include or be embodied in the form of computer-implemented processes and apparatuses for practicing those processes. Embodiments also may be embodied in the form of a computer program product having computer program code containing instructions embodied in non-transitory and/or tangible media, such as floppy diskettes, CD-ROMs, hard drives, USB (universal serial bus) drives, or any other machine readable storage medium, wherein, when the computer program code is loaded into and executed by a computer, the computer becomes an apparatus for practicing embodiments of the disclosed subject matter. Embodiments also may be embodied in the form of computer program code, for example, whether stored in a storage medium, loaded into and/or executed by a computer, or transmitted over some transmission medium, such as over electrical wiring or cabling, through fiber optics, or via electromagnetic radiation, wherein when the computer program code is loaded into and executed by a computer, the computer becomes an apparatus for practicing embodiments of the disclosed subject matter. When implemented on a general-purpose microprocessor, the computer program code segments configure the microprocessor to create specific logic circuits. In some configurations, a set of computer-readable instructions stored on a computer-readable storage medium may be implemented by a general-purpose processor, which may transform the general-purpose processor or a device containing the general-purpose processor into a special-purpose device configured to implement or carry out the instructions. Embodiments may be implemented using hardware that may include a processor, such as a general purpose microprocessor and/or an Application Specific Integrated Circuit (ASIC) that embodies all or part of the method in accordance with embodiments of the disclosed subject matter in hardware and/or firmware. The processor may be coupled to memory, such as RAM, ROM, flash memory, a hard disk or any other device capable of storing electronic information. The memory may store instructions adapted to be executed by the processor to perform the method in accordance with an embodiment of the disclosed subject matter.
The foregoing description and following appendices, for purpose of explanation, have been described with reference to specific embodiments. However, the illustrative discussions above are not intended to be exhaustive or to limit embodiments of the disclosed subject matter to the precise forms disclosed. Many modifications and variations are possible in view of the above teachings. The embodiments were chosen and described in order to explain the principles of embodiments of the disclosed subject matter and their practical applications, to thereby enable others skilled in the art to utilize those embodiments as well as various embodiments with various modifications as may be suited to the particular use contemplated.
A General Purpose Dual Optimization Method
In this section, we give a single unified algorithm to solve the dual problem for both classification and regression problems. This unified algorithm relies on an ordering lemma that allows us to determine which of the dual variables are positive, negative and zero.
We first present the ordering lemma for the classification setting. First observe that introducing Lagrange multiplier θ≧0 for the constraint ∥γ∥1≦λ for the dual problem gives us Lagrangian
Lemma 1. Assume that μj>0 and νj>0. Define κj:=log(μj/νj)−b. Then
κj>θ iff γj>0, κj<−θ iff γj<0, and −θ≦κj≦θ iff γj=0.
Proof. Let sjε∂|γj|. Then the subgradient condition for optimality of the dual (1) is
Let κj>θ and assume that γj≦0. Then sjε[−1,0], and
contradicting the subgradient conditions for optimality. The case for κj<−θ is similar, and when κjε[−θ, θ], then setting γj=0 gives sj=[−1,1] and
which satisfies the subgradient conditions for optimality.
For the converse, assume that γj>0 is optimal, that is, it satisfies Eq. (2). Then
and the case for γj<0 is similar. If γj=0, then by Eq. (2) there is some sjε[−1,1] for which κj+θsj=0, or κjε[−θ, θ]. □
Similar to the derivation for the logistic loss in, we have the following ordering lemma for the regression setting.
Lemma 2. The dual problem is
Further, given the optimal dual variable γ, the optimal α is
Again, a sorting algorithm using the unconstrained dual solution {circumflex over (γ)} gives an efficient algorithm for solving Eq. (??). The problem is clearly simply truncation when p=1 (so q=∞). When p=∞ so that q=1, we consider the Lagrangian for the negative dual, adding multiplier θ≧0 for the constraint that ∥γ∥1≦λ. We have
The structure of the solution is given by the following lemma.
Lemma 3. Let
Then
κj>θ iff γj>0, κj<−θ iff γj>0, and −θ≦κj≦θ iff γj=0.
Proof. Let sjε∂|γj|. Then the subgradient condition for optimality of the dual (3) is
Let κj>θ and assume for the sake of contradiction that γj≦0. Then sjε[−1,0], and
a contradiction to the fact that
Conversely, Eq. (4) implies that when γj>0,
The proof for the case that κj<−θ is similar. When κjε[−θ,θ], there is some sjε[−1,1] such that
so that Eq. (4) is satisfied. Conversely, if γj=0 is optimal, then Eq. (4) implies
We now derive our dual algorithm. We start with the simpler setting in which the dual is accompanied with l∞ constraints.
Solving the Dual with l∞ Constraints
When the primal problem uses l1-regularization, the dual problem has an l∞ constraint. Let {circumflex over (γ)} denote the unconstrained dual solution for either the regression or classification problem. Both objectives are separable, and the solutions are (see Eq. (2) and Eq. (4))
Thus, with the l∞-constraint added, the solution γ*j=max{min{{circumflex over (γ)}j, λ}, −λ} is immediate.
Solving the Dual with l1 Constraints
When p=∞, q=1 and the situation is slightly more complicated, as we now detail. Both problems have very similar structure, however. If ∥{circumflex over (γ)}∥1≦λ, then the KKT conditions for optimality imply that α=0 and no further work is needed. We thus focus on the case ∥{circumflex over (γ)}∥1>λ.
Lemmas 1 and 3 suggest an efficient algorithm that iteratively considers candidate θ values. Had we known the optimal θ*, computing the optimal γj is easy using Eq. (2) or Eq. (4). Thus, given θ, let γ(θ) denote the optimal γ. We define index sets I−, I0, and I+, containing indices for which γj<0, γj=0, and γj>0, respectively. By Lemmas 1 and 3, it is clear that I−={j:κj<−θ}, I+={j:κj>θ}, and I0={j:κjε[−θ,θ]}, allowing κj=±∞.
Our algorithm essentially initializes θ at infinity, places all indices for which |κj|<∞ into I0, then shrinks θ until the index sets change. We call such change values knots, and can compute the optimal γ(θ) given θ using Lemma 1 and Eq. (2) or Lemma 3 and Eq. (4), depending on our setting. The algorithm terminates when ∥γ(θ)∥1=λ. Evidently, the only values of θ we need consider are the κj. Let κ(1) denote the largest knot value, κ(2) the second, etc. (we take κ(0)=∞), and note that setting θ=κ(i) induces a partition of γ into I+, I0, and I−; for θε(κ(i), κ(i-1)), the index sets I are constant. There must be some i and setting of θε[κ(i), κ(i-1)) for which ∥γ(θ)∥1=λ, since our problems must satisfy the KKT conditions for optimality [?]. As noted earlier, if we knew the optimal θ, we could immediately reconstruct γ(θ) and α. On the other hand, if we have the correct partition of γ into the index sets I, we can reconstruct the optimal θ, which we now discuss.
Given a partition of γ into I+, I0, and I−, consider finding θ. We begin with the logistic loss. Solving for γ in Eq. (2), we have
Let t=e0. Then to find the θ such that ∥γ(θ)∥1=λ, assuming the partition of γ into the index sets I is correct, we solve
We can solve the above for t as follows. Let σμ+=ΣjεI+μj,σμ−=Σjε−μj,σν+=ΣjεI+νj, and σν−=ΣjεI−νj. Then a bit of algebra yields
−(σν++σμ−+λ)t2+(eb(σμ+−σμ−−λ)+(σν−−σν+−λ)e−b)t+(σμ++σν−−λ)=0. (6)
Clearly Eq. (6) is a quadratic in t, and we can solve for θ=log t (where we take the positive root, and if there is none, the algorithm simply continues). For the regression problem, we see that solving for γj in Eq. (4) gives γj(θ)=νj−μj(b+sjθ). Thus, setting the σ values as before for logistic regression, we require that
Solving for θ yields
Thus our algorithm proceeds by iterative considering knot values κ(i), partitioning γ into I+, I−, and I0, checking whether the θ induced by the partition falls in [κ(i), κ(i-1)), and returning when such a θ satisfying the KKT conditions is found.
The key to the algorithm is to find the optimal partition of γ into I+, I−, and I0. Our algorithm maintains a set I+ of indices j for which we know that γj>0. Initially, these are the j for which ν=0. Likewise, we maintain a set I− of indices j for which we know that γj<0 which are initially the j for which μ=0.
Our algorithm can be viewed as initializing our candidate for θ to ∞ which corresponds to the partition in which all indices j not initially places in I+ or I− have γj=0. We then consider the knots in order, moving indices corresponding to positive knots into I+, and indices corresponding to negative knots into I−. Let κi be the knot under consideration. We know that if the partition being considered is correct, then the value of θ for which Σj|γj|=λ must satisfy κi-1>θ≧κi. Since, we process the candidates for θ from largest to smallest, it thus follows that once we reach a partition that produces θ>κi (equivalently, υ≧eκi), we have the optimal partition and its corresponding value for θ.
B. Multiclass Problems
We now describe an efficient algorithm to solve the multiclass problem when an l1 regularizer is applied at the node level. Under this restriction, the estimation procedure for each child of a node s can be individually performed.
We focus in this section on the multiclass extension for the log loss. Recall that in a binary classification setting, each node s is associated with a bias value b=Συεp
for all examples that reach the node s. In the multiclass setting we instead need to represent the label distribution as a probability vector, u, rather than a single scalar. Thus, we need to replace the single scalar α which is associated with each node, with a vector α. The distribution induced over the labels takes the form pi˜eb
For the remainder of this section we consider a node s with prior u and focus on a single branch from s for which q is the empirical distribution over the labels following that branch. Using the notation introduced earlier, we define qk=1/kΣu:y
such ∥γ∥1≦λ and Σiγi=0. To solve the dual form we introduce a Lagrange multiplier θ≧0 for the l1 constraint and δ for the constraint that Σiγi=0, and obtain the following Lagrangian,
Denoting si=sign(γi), and using the sub-gradient optimality condition with respect to γ yields that,
where z is the standard normalization (partition function) which ensures that p is a proper distribution. Eq. (1) underscores the relation between γ and p. Specifically, Eq. (1) implies that when γi>0, ui≦pi<qi, and for γi<0, ui≧pi>qi. In words, the solution p lies between q and u where the lower and upper bounds on each coordinate in p depends on the relation between the corresponding components in q and u. This characterization facilitates the efficient procedure for finding the optimum which we describe in the sequel.
Let I+ be the set of indices for which γi>0, I− be the set of indices for which γi<0, and I0 be the set of indices for which γi=0. Define
and similarly,
Combining Eq. (1) with the constraint that Σiγi=0 (which stems from the requirement Σipi=1) yields
(eθU++e−θU−)/z=Q++Q−. (2)
Similarly, combining Eq. (1) with the constraint Σi|γi|=λ yields
(−eθU++e−θU−)/z=λ−Q++Q−. (3)
Combining the last two equalities gives a close form solution for θ and z,
Our derivation is not over. In order to further characterize and find the solution we need to find the correct partition of the components of γ into the sets I+, I−, I0.
From Eq. (1) it immediately follows that when γi>0, log(pi/ui)+log z=θ and when γi<0, log(pi/ui)+log z=−θ. Furthermore, by applying the KKT conditions for optimality, the following property holds,
|log(qi/ui)+log z|<θγi=0. (4)
We now combine these properties to obtain an efficient algorithm for finding the optimal partition into I+, I− and I0 in the optimal solution. First observe we can sort the components according to the ratios qi/ui. Without loss of generality and for clarity of our derivation, let us assume that q1/u1≦q2/u2≦ . . . ≦qn/un, where n is the number of different labels. From Eq. (4) we know that there must exist two indices r and s such that 1≦r<s≦n and qr/ur<1 and qs/us>1. In turn, these ratio properties imply that that for j≦r, γj<0, γr+1= . . . =γs-1=0, and for j≧s, γj>0. The next key observation is that had we were given the partition, then we could have computed the solution corresponding to that partition using the from the equations for z and θ. Finally, from Eq. (4), it is clear that a candidate partition is optimal iff θ>0 and for all i such that |log(qi/ui)+log z|<θ, the value of γi is zero.
The algorithm to find a partition of the indices into I+, I− and I0 proceeds as follows. Initially, we place all the indices in I0. In an outer loop, going down and beginning at n, we add the next element I+. We also maintain the sums Q± and U±. These sums are used compute z and θ for each candidate partition in constant time. The sums are initially set to 0 and are updated in constant time as elements are moved from I0 into either I+ or I−. It is easy to verify that for the optimal solution Q+>λ/2. We can thus add elements to I+ until this condition is met. Let us define t+=(Q+−λ/2)/U+). Next, for each candidate set I+, we consider all feasible candidate sets I− by incremental adding elements, starting with index 1. We also define t−=(Q−+λ/2)/U−. Note that we can rewrite θ=½t+/t−. Since, if t+/t−≦0, the candidate partition that leads to these values is not feasible. Moreover, since θ>0, t+ must be greater than t−. If either of the two conditions do not hold we the partition is not feasible and we can proceed to examine the next partition by adding one more element to I−. If the two conditions hold, we can finally calculate candidate values for θ and in turn z=(eθU++e−θU−)/(Q++Q−). Finally, if the 1-norm of the resulting solution is greater than λ, then we identified yet another infeasible partition. This condition as well can be verified in constant time since, ∥γ∥1=Q+−Q−+(U−e−θ−U+eθ)/z. Finally, as discussed above, the solution is optimal if and only if |log qi/ui+log z|≦0 for iεI0. This condition can be checked in constant time as well by simply examining the largest and smallest ratios qi/ui for iεI0. The time complexity of this procedure for finding the optimum is O(n2) since we might need to examine all possible pairs (r, s) such that 1≦r<s≦n and qr/ur<1 and qs/us>1. Since typically the label set is not large and we can quickly disqualify candidate partitions we found that this procedure is in practice very fast.
Entry |
---|
Goldman et al., “Self-Pruning Prediction Trees”, Feb. 10, 2010, pp. 1-2. |