This application is based upon and claims the benefit of priority of the prior Japanese Patent Application No. 2008-193067, filed on Jul. 28, 2008, the entire contents of which are incorporated herein by reference.
The present invention relates to a high-speed technique of rule learning for machine learning.
Among the various known machine learning algorithms, there exist algorithms referred to as “Boosting.” Here, there will be discussed a learning technique based on a technique referred to as “AdaBoost” which is one of the Boosting algorithms. As for the AdaBoost technique, there exist, for example, a paper by Y. Freund and L. Mason (Y. Freund and L. Mason, “The alternating decision tree learning algorithm”, In Proc. of 16th ICML, pages 124-133, 1999), and a paper by R. E. Schapire and Y. Singer (R. E. Schapire and Y. Singer, “Improved boosting using confidence-rated predictions”, Machine Learning, 37(3): pages 297-336, 1999), and (R. E. Schapire and Y. Singer, “Boostexter: A boosting-based system for text categorization”, Machine Learning, 39 (2/3): pages 135-168, 2000). In the following, Boosting, unless otherwise specified, refers to AdaBoost.
In Boosting, a plurality of weak hypotheses (e.g., rules) are generated from training examples having different weights by using a given weak learner for creating a final hypothesis consisting of the generated weak hypotheses. Each weak hypothesis is repeatedly generated from the training examples while the weights of the examples are changed. Finally, a final hypothesis, which is a combination of the weak hypotheses, is generated. A small weight is assigned to an example which can be correctly classified by the already learned weak hypotheses, and a large weight is assigned to an example which cannot be correctly classified by the already learned weak hypotheses.
The weights of the training examples are updated so as to reduce the upper bound of the training error, which is the number of errors for the training examples. The upper bound of the training error is a value greater than or equal to the actual number of training errors, and is the sum of the weights of the examples in Boosting. The number of training errors itself is lowered by lowering the upper bound of the training error.
A Boosting algorithm that handles a rule learner as the weak learner is used in the present description. Further, in the following, this algorithm will be described as a Boosting algorithm. First, there will be described a simple Boosting algorithm with reference to
Then, a score (also referred to as gain) of each of the features included in the training sample is calculated according to the weights wt,i of the examples, so that a feature whose score becomes a maximum is extracted as a rule ft (at S103). The wt,i is the weight of the sample number i at round t. The calculation of the scores is performed by using, for example, equation (4) in Formula 6 as will be described below. Note that there is also a possibility that the number of features may be about 100,000, and where the number of examples included in the training sample may also be about 100,000. Thus, it may take considerable time to calculate the scores, but only one feature is selected.
Further, a confidence value ct of the rule ft is calculated by using the weights wt,i of the examples, and then the rule ft and the confidence value ct are stored as the t-th rule and confidence value (at S105). The calculation of the confidence value ct is performed by using, for example, equation (2) in Formula 4 or equation (7) in Formula 9 as will be described below.
Thereafter, new weights wt+1,i (1≦i≦m) are calculated by using the weights wt,i of the examples, the rule ft, and the confidence value ct, and are registered to update the weights (S107). The calculation of the new weights wt+1,i is performed by using, for example, equation (6) in Formula 8 as will be described below.
Then, the value of the variable t is incremented by one (S109). When the value of the variable t is smaller than the iteration frequency N, the processing is returned to S103 (at S111: Yes). On the other hand, when the value of the variable t reaches the iteration frequency N (at S111: No), the processing is ended.
By using the combinations of the rules and the confidence values, which are obtained as a result of the above described processing, it is determined whether the label of a new input is −1 or +1.
As described above, only one combination of the rule and the confidence value can be generated in one iteration. Thus, there is a problem that when the number of features and the number of training examples are increased, the processing time increases enormously.
For this reason, a high-speed version of the Boosting algorithm was considered. This high-speed version of the Boosting algorithm illustrated as
Then, a score (also referred to as gain) of each of the features included in the training sample is calculated according to the weights wt,i of the examples, so that ν features are extracted as rules f′j (1≦j≦ν) in descending order of the scores (at S153). The calculation of the score is performed by using, for example, equation (4) in Formula 6 as will be described below. When the scores are calculated from the data illustrated as
Then, each confidence value c′j corresponding to the ν number of rules f′j are collectively calculated by using the weights wt,i of the examples (at S155). The calculation of the confidence values c′t is performed by using, for example, equation (2) in Formula 4 or equation (7) in Formula 9 as will be described below. At S155, the ν confidence values c′j are calculated by using the same weights wt,i. In the above description, illustrated as
Here, j is initialized to 1 (at S157). Then, new weights wt+1,i (1≦i≦m) are calculated by the weights wt,i of the examples, the rule f′j, and the confidence value c′j, and are registered to update the weights (at S159). The calculation of the new weights wt+1,i is performed by using, for example, equation (6) in Formula 8 as will be described below. In the above described example, the calculation of weight is performed to a rule a. As illustrated in
The value of variable t and the value of variable j are respectively incremented by one (at S163), and it is determined whether or not the value of j is equal to or less than the value of ν (at S165). When the value of j is equal to or less than the value of ν, the processing shifts to S159.
When j=2, and when S159 is performed, the weights are calculated for a rule b in the above described example, so that new weights are registered to update the weights used in the calculation illustrated as
Further, when j=3, and when S159 is performed, the weights are calculated for a rule c, so that new weights are registered to update the weights used in the calculation illustrated as
On the other hand, when j exceeds ν, it is determined whether or not t is smaller than the iteration frequency N (at S167). When t<N, the processing returns to S153. The scores are again calculated in S153 so that the values of the scores are obtained as illustrated in
On the other hand, when t reaches the iteration frequency N (at S167: No), the processing is ended.
By using the combinations of the rules and the confidence values obtained as a result of the above described processing, it is determined whether the label of a new input is −1 or +1.
By performing the processing illustrated as
According to an aspect of the invention, a rule learning method for making a computer perform rule learning processing in machine learning includes firstly calculating an evaluation value of respective features in a training example data storage unit storing a plurality of combinations of a training example and a weight of the training example, each example which includes one or more features and a label showing that the example is either true or false, by using data of the training examples and the weights of the training examples, and storing the calculated evaluation value in correspondence with the feature in an evaluation value storage unit; selecting a given number of features in descending order of the evaluation values stored in the evaluation value storage unit; secondly calculating a confidence value for one of the given number of selected features, by using the data and the weights of the training examples in the training example data storage unit, and storing a combination of the confidence value and the one of the selected features in a rule data storage unit; updating the weights stored in the training example data storage unit, by using the data and weights of the training examples, and the confidence value corresponding to the one feature; firstly repeating the updating for the remaining features of the given number of features; and secondly repeating, for a given number of times, the firstly calculating, the selecting, the secondly calculating, the updating, and the firstly repeating.
It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the invention, as claimed.
When the processing as illustrated in
An object of the present technique is to perform the rule learning in Boosting at a higher speed, and also to prevent the increase in the upper bound of the training error.
First, there will be described a problem handled by the Boosting algorithm. Here, it is assumed that χ is a set of examples, and that a set of labels handled is y={−1, +1}. Further, an object of learning is to derive a mapping F: χ−>y from a training sample S={(x1, y1), . . . , (xm, y1)}.
Here, it is assumed that |x| represents types of features included in an example x: xεχ. It is assumed that the relation: xiεχ (1≦i≦m) means that a feature-set is configured by features of |xi| types. Further, here, the feature-set which is configured by k features is described as a k-feature-set. Further, the relation: yiεy means that yi is the class level of the i-th feature-set included in S.
It is assumed that FT={f1, f2, . . . , fM} represents M types of features to be handled by the Boosting algorithm. Each of the features of the each example xi is expressed as xi,jεFT (1≦j≦|xi|). The present technique is also capable of handling binary vectors. However, in the example described below, it is assumed that each of the features is expressed by a character string.
Further, a case where a certain feature-set includes another feature-set is defined as follows.
Definition 1
In two feature-sets x and x′, when x′ includes all the features included in x, x is referred to as a subset of the feature-set x′, and is described as x⊂x′.
Further, in the present embodiment, rules are defined on the basis of the theory of real-valued predictions and abstaining (RVPA) which is explained in the document (R. E. Schapire and Y. Singer, “Boostexter: A boosting-based system for text categorization”, Machine Learning, 39(2/3): 135-168, 2000) as described in the section of the background art. In the RVPA, when a feature-set of an input matches a condition, a confidence value expressed by a real number is returned. When the feature-set of the input does not match with the condition, “0” is returned. A weak hypothesis for classifying a feature-set is defined as follows.
Definition 2
When it is assumed that a feature-set f is a rule, and that x is a feature-set of an input, and further when it is assumed that a real number c is a confidence value of the rule f, the application of the rule is defined as follows.
The rule learning based on Boosting is to acquire combinations between T types of rule feature-sets and confidence values of the rule feature-sets (<f1, c1>, . . . , <fT, cT>) by learning in T times of Boosting rounds with a weak learner, so as to construct F defined as follows.
Note that sign(x) expresses a function which takes a value of 1 when x is 0 or more, or otherwise takes a value of −1.
The weak learner derives a rule ft and a confidence value ct of the rule ft by using the training sample S{(xi, yi)} (1≦i≦m) and weights {wt, 1, . . . wt, m} of the respective training examples at the time of the t-th Boosting round. The expression wt,i (0<wt,i) denotes a weight of the i-th example (xi, yi) (1<i≦m) in the t-th Boosting round (1≦t≦T).
Based on the given training sample and the weights of the training examples, the weak learner selects, as a rule, a feature-set f which minimizes the following formula, and the confidence value c of the feature-set f.
Note that [[π]] is 1 if a proposition π holds and 0 otherwise.
The reason why equation (1) in Formula 3 is used as a reference for selecting the rule is because the upper bound of the training error of the learning algorithm based on Boosting relates to the sum of the weights of the examples.
When equation (1) in Formula 3 is minimized by a certain rule f, the confidence value c at this time is expressed as follows.
The following formula is obtained by substituting equation (2) in Formula 4 into Formula (I).
It may be seen from equation (3) in Formula 5 that minimizing equation (1) in Formula 3 is equivalent to selecting a feature-set f which maximizes a score defined by the following formula.
The weights of the respective examples are updated by (ft, ct). Note that there are cases where the weights are normalized so that the total sum of the weights becomes 1, and cases where the weights are not normalized. When normalization is performed, the weight wt+1,i in the (t+1)th round is defined as follows.
When normalization is not performed, the weight wt+1,i in the (t+1)th round is defined as follows.
[Formula 8]
W
t+l,i
=W
t,i exp(−yih(f
Note that it is assumed that when the normalization is performed, the initial values w1,i of the weights are set to 1/m (where m is the number of training examples), and that when the normalization is not performed, the initial values w1i of the weights are set to 1.
Further, when the features seldom appear (that is, where features appear in only few examples), there may arise a case where the value of wt,+1(f) or Wt,−1(f) becomes a very small value or 0. In order to avoid this, a value ε for smoothing is introduced.
That is, equation (2) in Formula 4 is transformed as follows.
For example, ε=1/m or ε=1 is used.
An embodiment of the present technique is described on the basis of the premise as described above.
First, for example, according to an instruction from a user, the training sample input section 1 receives inputs of: a training sample S={(x1, y1), (x2, y2), . . . , (xm, ym)} including m examples, each of which is a combination of a feature-set xi that includes one or more features with a label yi that is either −1 or +1; initial values w1,i=1 (1≦i≦m) of m weights corresponding to the m examples; an iteration frequency N; the number ν of rules learned at one time; and a variable t=1 for counting the iteration frequency. The training sample input section 1 stores the received inputs in the training sample storage unit 3 (at S1). In order to facilitate understanding, an example is described in which a training sample as illustrated in
The rule learning section 5 performs rule extracting processing by using the data stored in the training sample storage unit 3 (at S3). Next, the rule extracting processing is described with reference to
First, the rule learning section 5 extracts an unprocessed feature included in the training sample S as a rule candidate (at S21). For example, the rule learning section 5 extracts a feature a in the example illustrated in
In order to calculate the score of the feature a in the example illustrated in
W1,+1(a)=1×[[a⊂(a b c)(+1)=(+1)]]+1×[[a⊂(a b c d)(−1)≠(+1)]]+1×[[a⊂(a b d)(+1)=(+1)]]=1+0+1=2, and
W1,−1(a)=1×[[a⊂(a b c)(−1)≠(+1)]]+1×[[a⊂(a b c d)(−1)=(−1)]]+1×[[a⊂(a b d)(−1)≠(+1)]]=0+1+0=1
Therefore, the score of the feature a is calculated as |sqrt(2)-sqrt(1)|=0.414, where sqrt(x) is the radical sign of x.
Then, the rule learning section 5 determines whether or not all the features are processed (at S25). When an unprocessed feature exists, the rule learning section 5 returns to S21. In the example described above, when such processing is repeated, the scores of the features a, b, c, and d are calculated, and a score table as illustrated in
On the other hand, when no unprocessed feature exists, the rule learning section 5 sorts the records of the score table in the descending order of the scores, and selects ν features (rule candidates) having higher scores as rules f′j (1≦j≦ν) (S27). Then, the rule learning section 5 returns to the original processing. For example, in the case of the score table as illustrated in
Returning to the explanation of the processing in
In the present embodiment, unlike the prior art, the rule learning section 5 calculates one confidence value c′j for one rule f′j by using the present weights wt,i . . . .
Then, the rule learning section 5 calculates new weights wt+1,i by using the weights wt,i, the rule f′j, and the confidence value c′j, and registers the weights wt+1,i in the training sample storage unit 3 to update the weights wt,i (at S9).
For example, as illustrated in
Then, the rule learning section 5 registers the rule f′j and the confidence value c′j in the rule data storage unit 7 as the t-th rule and confidence value (at S11).
Thereafter, the rule learning section 5 increments both t and j by one, (at S13). Then, the rule learning section 5 determines whether or not j is equal to or less than ν (at S15). When j is equal to or less than ν, the rule learning section 5 returns to S7, and performs the processing for the next rule f′j.
In this way, in the present embodiment, each time the rule learning section 5 calculates the confidence value c′j corresponding to the rule f′j, the rule learning section 5 updates the weights wt,i, and thereby prevents the upper bound of the training error from being increased.
In the example described above, the rule learning section 5 returns to S7 and performs the processing of the feature b, so that the confidence value c of the feature b is calculated to be 0.054. When the rule learning section 5 calculates the weights wt+1,i to be used in the next calculation, by using the confidence value c and the weights wt,i illustrated in
Further, in the above described example, the rule learning section 5 returns to S7 and performs the processing of the feature c, so that the confidence value c of the feature c is calculated to be −0.249. When the rule learning section 5 calculates the weights wt+1,i to be used in the next calculation, by using the confidence value c and the weights wt,i as illustrated in
On the other hand, when j exceeds ν, the rule learning section 5 determines whether or not t is smaller than N (at S17). When t is smaller than N, the rule learning section 5 returns to the process of S3. On the other hand, when t is equal to or more than N, the rule learning section 5 ends.
In the above described example, when the rule learning section 5 returns to S3 and again calculates the scores of the features a, b, c, and d, the results as illustrated in
When the weights wt,i illustrated in
Next, when the confidence value c of the feature b is calculated by using the weights wt,i as illustrated in
Further, when the confidence value c of the feature d is calculated by using the weights wt,i as illustrated in
Convergency of AdaBoost in BoosTexter is described as below.
First, there will be described the upper bound of the training error of AdaBoost which is proved in Theorem 1 of the document (R. E. Schapire and Y. Singer, “Improved boosting using confidence-rated predictions”, Machine Learning, 37(3): 297-336, 1999), the document which is described in the background art. The AdaBoost described in this document is the normalization of weights, and is hereinafter referred to as AdaBoost-normalized. The upper bound of the training error becomes a product of the weights of the examples in each round as explained in the document. Subsequently, even when any rule is added, the upper bound of the training error becomes smaller than or equal to the upper bound of the training error of the previous round.
First, the upper bound of the training error of F, which is configured by T rules derived on the basis of AdaBoost as proposed in the above described document, is expressed as follows.
First, there will be described the upper bound of the training error.
When the weights are set as w1/i=1/m, equation (8) in Formula 11 obtained by developing equation (5) in Formula 7 expressing the weight update rule.
Further, when F(xi)≠yi, the following formula is obtained.
Thus, the following formula is obtained.
As a result, the following formula is obtained.
Therefore, from equation (8) in Formula 11 and equation (9) in, Formula 14 the above described upper bound of the training error is obtained as follows.
Subsequently, when a new rule is added to (T−1) rules derived by AdaBoost-normalized, even when any rule is added, the upper bounds of the training error, which are obtained respectively from the (T−1) rules and the T rules, have the following relation.
First, the upper bound of the training error obtained from the T rules is rewritten as follows.
Here, when ZT is rewritten on the basis of definition 2, the following formula is obtained.
Here, WT−1(f) is expressed as follows.
[Formula 19]
W
T(f)=Σi=1mwT,i−WT,+1(f)−WT,−1(f)
Thus, the following formula is obtained.
By this formula, ZT is eventually rewritten as follows.
Note that in AdaBoost-normalized, the total sum of the (t−1)th weights is 1, and hence the following formula is obtained.
[Formula 22]
Z
T=1−(√{square root over (WT,+1(fT))}−√{square root over (WT,−1(fT)))}2≦1 (11)
From equation (10) in Formula 17 and equation (11) in Formula 22, as described above, the following relation is obtained.
Therefore, even when any rule is added, the upper bound of the training error becomes smaller than or equal to the upper bound of the training error of the previous round.
There is another kind of AdaBoost, which is AdaBoost in the AdaTrees learning algorithm. It is described in the document (Y. Freund and L. Mason, “The alternating decision tree learning algorithm”, In Proc. of 16th ICML, pages 124-133, 1999) which is introduced in the background art.
AdaBoost in the AdaTrees learing algorithm does not normalize the weights. Therefore the AdaBoost in the AdaTrees learning algorithm is hereinafter referred to as AdaBoost-unnormalized. The upper bound of the training error of AdaBoost-unnormalized becomes the sum of the weights of the examples, the weights of which are updated in each round. This is derived on the basis of Theorem 1 of the document (R. E. Schapire and Y. Singer. Boostexter: A boosting-based system for text categorization. Machine Learning, 39(2/3):135-168, 2000). Further in AdaBoost used in the AdaTrees learning algorithm, even when any rule is added, the upper bound of the training error becomes smaller than, or at worst equal to the upper bound of the training error of the previous round.
First, F, which is configured by T rules derived by AdaBoost used in the Adatrees learning algorithm, includes the upper bound of the training error, which is expressed as follows.
Here, the following formula is used.
Equation (12) in Formula 26 is obtained by developing equation (6) in Formula 8 which is the weight update rule.
Further, when F(xi)≠yi, the following relation is established.
Thus, the following formula is obtained.
As a result, the following formula is obtained.
Therefore, the upper bound of the training error is obtained from equation (12) in Formula 26 and equation (13) in Formula 29 as follows.
Subsequently when a new rule is added to the (T−1) rules derived by AdaBoost used in the AdaTrees learning algorithm, even when any rule is added, the upper bound of the training error, which is obtained from the (T−1) rules, and the upper bound of the training error, which is obtained from T rules, have a relationship of Z′T≦Z′T−1.
From definition 2, the upper bound of the training error obtained from T rules is rewritten as follows.
Here, the following formula is used.
Then, the following formula is obtained.
From this formula, the following relationship is obtained for the upper bound of the training error, which is obtained from the T rules.
Therefore, even when any rule is added, the upper bound of the training error becomes smaller than or equal to the upper bound of the training error of the previous round.
By performing the above described processing, the combinations of the rules and the confidence values are registered in the rule data storage section 7, and substantially the same classification processing as the conventional classification processing is performed by the rule application section 13.
Here, there is discussed the effect when, for the training sample in
Further, by using a training sample in which 1340 words are proper nouns among 12821 words, and which is prepared for the proper noun discrimination, the degree of accuracy of discrimination of the proper nouns for the training sample is measured for each learning time, and the measurement results are depicted in
F=2*Recall*Precision/(Recall+Precision)
Recall=the number of correctly distinguished proper nouns/the number of proper nouns
Precision=the number of correctly distinguished proper nouns/the number of proper nouns responded to
It can be clearly seen that the accuracy may be more rapidly improved by adopting the processing flow of
As described above, an embodiment according to the present technique has been described, but the present technique is not limited to this. For example, the functional block diagram in
Further, the present technique may also be applied to Boosting algorithms which handle other weak learners. For example, as an example of another weak learner, there exists an algorithm referred to as C4.5 (see, for example, C4.5: Programs for Machine Learning, Morgan Kaufmann Series in Machine Learning, J. Ross Quinlan, Morgan Kaufmann, 1993). The C4.5 learns a weak hypothesis (that is, a rule) in the form of a decision tree. When the present technique is applied to the C4.5, it is possible to apply the present technique in such a manner by learning more then one decision trees at each iteration.
Further, even in the case of a Boosting algorithm which handles a weak learner for classifying trees and graphs, the present technique may be similarly applied by learning more then one rules.
For example, see the following documents.
Document: Kudo Taku, Matsumoto Yuji, “A Boosting Algorithm for Classification of Semi-Structured Text”, Proceedings of EMNLP 2004, pages 301-308, 2004.
Document: Taku Kudo, Eisaku Maeda, Yuji Matsumoto, “An Application of Boosting to Graph Classification”, Proceedings of NIPS 2004, pages 729-736, 2005.
An embodiment of the present technique is summarized as follows.
An evaluation value g may also be calculated with the following formula from a feature f, feature-sets xi, labels yi, weights wi of training examples, and the number m of the training examples.
(where [[π]] is 1 if a proposition π holds and 0 otherwise).
Further, the confidence value c may also be calculated with the following formula from a feature f, feature-sets xi, labels yi, weights wi of training examples, the number m of the training examples, and a certain smoothing value ε.
(where [[π]] is 1 if a proposition π holds and 0 otherwise).
Further, the weights wt+1,i of the training examples, the weights of which are used for the (t+1)th processing, may also be calculated by the following formulas from the feature ft and the confidence value ct in the t-th processing, the feature-sets xi, the labels yi, the weights wt,i of the training examples in the t-th processing, and the number m of the training examples.
Alternatively, the weights wt+1,i of the training examples may also be calculated by dividing the above-obtained wt+1,i by Zt expressed as follows.
A program may be created for making a computer perform the present technique. The program may be stored in a computer readable storage medium such as, for example, a flexible disk, CD-ROM, a magneto-optical disk, a semiconductor memory, and a hard disk, or in a storage apparatus. Further, the program may also be distributed as digital signals via a network, and the like. Note that intermediate processing results may be temporarily stored in a storage apparatus, such as a main memory.
Note that a business system analysis apparatus is a computer apparatus, and is, as illustrated in
All examples and conditional language recited herein are intended for pedagogical purposes to aid the reader in understanding the principles of the invention and the concepts contributed by the inventor to furthering the art, and are to be construed as being without limitation to such specifically recited examples and conditions, nor does the organization of such examples in the specification relate to a showing of the superiority and inferiority of the invention. Although the embodiment of the present invention has been described in detail, it should be understood that the various changes, substitutions, and alterations could be made hereto without departing from the spirit and scope of the invention.
Number | Date | Country | Kind |
---|---|---|---|
2008-193067 | Jul 2008 | JP | national |