This application claims the benefit of priority under 35USC §119 to Japanese Patent Application No. 2005-152324 filed on May 25, 2005, the entire contents of which are incorporated herein by reference.
1. Field of the Invention
The present invention relates to a data division apparatus, data division method and program used to conduct data division (clustering) on a point set in an n-dimensional space.
2. Related Art
In recent years, a plant system is constructed so as to find an abnormality in the plant by monitoring proper ranges of sensors attached to individual apparatuses (objects to be measured) included in the plant system, in some cases. A proper range that the sensor value should assume is previously set, and an abnormality alarm is issued when the sensor value has got out of the proper range. As the number of sensors increases, automatization of the proper range setting is desired. For setting a proper range for a certain sensor (hereafter referred to as target sensor), at least one other sensor (hereafter referred to as explanatory sensor) can be used. A model for predicting the target sensor on the basis of the explanatory sensor is constructed. If its predicted value differs largely from the actual value, the possibility that the target sensor indicates an abnormal value is high.
The prediction model can be created by using time series data (multi-dimensional data) of the target sensor and the explanatory sensor collected in the past. In general, however, construction of this prediction model is not easy. It is because a value assumed by the target sensor is not determined uniquely by the value of the explanatory sensor, but it depends on the running situation of the plant as well. This situation will now be described by using an example of a sensor in a power plant.
It is now supposed that there is plot data (running history data) with its ordinate indicating a pressure of a pump output from the target sensor and its abscissa indicating a generated power output which is output from the explanatory sensor. The pump has an operation state and a non-operation state. It is supposed that in the operation state of the pump the pressure of the pump is in proportion to the generated power output and in the non-operation state of the pump the pressure of the pump assumes a low constant value. If a model for predicting the value of the target sensor on the basis of the explanatory sensor is generated by using, for example, regression analysis, without separating the above two operation situations, the error of the model becomes great. It is desirable to generate models respectively based on the operation situations of the pump. For doing so, it is necessary to separate a set of points in the running history data into a plurality of groups and generate models respectively for groups.
As techniques for grouping points on a plane or in a space, there are the k-means method and the agglomerative method. These techniques are described in Michael J A Berryand Gordon Linoff, “Data Mining Techniques”, Wiley Computer Publising, pp. 187-215.
In the k-means method, k initial points are selected previously, and each point of remaining points is regarded as belonging to the same group as a point among the k points closest to the point. A centroid is calculated for every group, and grouping is repeated again regarding centroids as k initial points. On the other hand, in the agglomerative method, a combination having the shortest distance among all combinations of points is regarded as one group. A centroid of grouped points is regarded as one point, and similar processing is repeated until all points belong to one group. Incidentally, as for other ways of measuring a distance, there are a method using a distance between closest points in groups and a method using a distance between farthest points.
In these techniques, basically close points are grouped and only a distance between points is considered. In these techniques, therefore, grouping properly reflecting the above-described state of the measurement subject, i.e., grouping reflecting the tendency other than the distance between points, which is immanent in the multi-dimensional data, for example, grouping close to human instinct, cannot be conducted.
According to an aspect of the present invention, there is provided with a data division apparatus which divides multi-dimensional data including a plurality of data pieces, comprising: a data input unit which inputs the multi-dimensional data; a division plane candidate creator which creates a plurality of division plane candidates for dividing the multi-dimensional data; a data provisional division unit which provisionally divides the multi-dimensional data by using the division plane candidates to generate clusters from each of the division plane candidates, the clusters each including one or more data piece; a model generator which generates models representing the clusters, per each of the division plane candidates; an evaluation value calculator which calculates evaluation values for evaluating the division plane candidates, on the basis of the models generated associated with the division plane candidates and the multi-dimensional data; a division candidate selector which compares evaluation values respectively corresponding to the division plane candidates and selects a division plane candidate having a highest evaluation value; and a data division unit which divides the multi-dimensional data by using the selected division plane candidate.
According to an aspect of the present invention, there is provided with a data division apparatus which divides multi-dimensional data including a plurality of data pieces, comprising: a data input unit which inputs the multi-dimensional data; a division plane candidate creator which creates a plurality of division plane candidates for dividing the multi-dimensional data; a data provisional division unit which provisionally divides the multi-dimensional data by using the division plane candidates to generate clusters from each of the division plane candidates, the clusters each including one or more data piece; a model generator which generates models representing the clusters, per each of the division plane candidates; a grouping unit which generates new clusters by grouping data pieces in the multi-dimensional data, based on which of the generated models each data piece in the multi-dimensional data is close to, per each of the division plane candidates; an evaluation value calculator which calculates evaluation values for evaluating the groupings associated with the division plane candidates, based on the models generated associated with the division plane candidates and the new clusters generated associated with the division plane candidates; and a division candidate selector which compares evaluation values respectively corresponding to the division plane candidates and selects grouping result corresponding a division plane candidate having a highest evaluation value.
According to an aspect of the present invention, there is provided with a data division method which divides multi-dimensional data including a plurality of data pieces, comprising: inputting the multi-dimensional data; creating a plurality of division plane candidates for dividing the multi-dimensional data; provisionally dividing the multi-dimensional data by using the division plane candidates to generate clusters from each of the division plane candidates, the clusters each including one or more data piece; generating models representing the clusters, per each of the division plane candidates; calculating evaluation values for evaluating the division plane candidates, on the basis of the models generated associated with the division plane candidates and the multi-dimensional data; comparing evaluation values respectively corresponding to the division plane candidates and selecting a division plane candidate having a highest evaluation value; dividing the multi-dimensional data by using the selected division plane candidate; and performing the creating, the dividing, the generating, the calculating, the comparing and the dividing for divided multi-dimensional data.
According to an aspect of the present invention, there is provided with a program for inducing a computer to execute: reading out multi-dimensional data including a plurality of data pieces, from a storage device; creating a plurality of division plane candidates for dividing the multi-dimensional data; provisionally dividing the multi-dimensional data by using the division plane candidates to generate clusters from each of the division plane candidates, the clusters each including one or more data piece; generating models representing the clusters, per each of the division plane candidates; calculating evaluation values for evaluating the division plane candidates, on the basis of the models generated associated with the division plane candidates and the multi-dimensional data; comparing evaluation values respectively corresponding to the division plane candidates and selecting a division plane candidate having a highest evaluation value; dividing the multi-dimensional data by using the selected division plane candidate; and performing the creating, the dividing, the generating, the calculating, the comparing, the dividing for divided multi-dimensional data.
First, an outline of an embodiment of the present invention will be described briefly.
Measurement subjects 21, 22, 23 and 24 are disposed in facilities in a plant. Sensors x, y, z and w are installed respectively in the measurement subjects 21, 22, 23 and 24. Data 11, 12, 13 and 14 acquired respectively from the sensors x, y, z and w in time series are stored as four-dimensional data having a series length n (multi-dimensional data) (see
The present embodiment provides a technique capable of reflecting the state of the measurement subject of the target sensor and conducting data division (clustering) on multi-dimensional data, i.e., data division which properly reflects the tendency other than the distance between points, immanent in multi-dimensional data. By this data division, multi-dimensional data are properly divided into a plurality of clusters. In the present embodiment, models respectively corresponding to clusters are also generated.
The models 15b and 16b thus generated are used to, for example, determine in real time whether the value of the target sensor y is in a proper range. For example, it is determined on the basis of a previously generated classification rule which of the clusters 15a and 16a data 17 of the target sensor acquired at a certain point of time belongs to. It is now assumed that the data 17 belongs to the cluster 15a. In this case, data 17 is input to the model 15b and a model output is found. A difference 18 between the model output and the data 17 is calculated. If the difference 18 is in a predetermined range, the measurement subject is judged to be in the normal state. Otherwise, the measurement subject is judged to be in the abnormal state.
Hereafter, embodiments of the present invention will be described in detail.
This data division system includes a CPU 31, a memory 32, a hard disk 33, and a display apparatus 34. A program is stored on the hard disk 33 to implement the present embodiment. Data acquired from a plurality of sensors in time series are stored on the hard disk 33 as multi-dimensional data. The CPU 31 loads the program stored on the hard disk 33 into the memory 32 and executes the program. The display apparatus 34 displays a result of execution in the CPU 31 to the user.
The data division apparatus shown in
The data input unit 41 inputs multi-dimensional data to the data discretization unit 42. The multi-dimensional data includes a plurality of data pieces. An example of the multi-dimensional data is shown in
The data discretization unit 42 discretizes the input multi-dimensional data (step 1). Hereafter, details thereof will be described.
Elements in each of x and y dimensions are discretized to integers in the range of 0 to 1-m by using a minimum value and a maximum value. Here, m is an arbitrary integer given by the user. For example, it is now supposed that the x dimension has a minimum value xmin and a maximum value xmax and ith data (i.e. ith data piece) has a value xi in the x dimension. A value xdi of the ith data in the x dimension after discretization depends on which section from the head among sections obtained by equally dividing the range between xmin and xmax by m, xi comes in. The processing heretofore described is conducted for the y dimension as well. Owing to the processing heretofore described, the elements xi and yi of the data i in respective dimensions are respectively discretized to xdi and ydi (where 0≦xdi≦m−1, 0≦ydi≦m−1, and xdi and ydi are integers).
In processing described hereafter and processing in other embodiments, either of data before the discretization and data after the discretization may be used as the multi-dimensional data unless otherwise mentioned. In the former case, the processing becomes fast, but the precision becomes lower. Conversely, in the latter case, the processing time becomes longer, but the precision becomes higher. In other words, the discretization processing is conducted to reduce the amount of calculation in processing at step 2 and subsequent steps, and is not indispensable for the present invention.
The division plane candidate creator 43 finds a set of planes (a set of lines in the case of two-dimensional data) perpendicular to each axis, as candidates for a division plane which bisects multi-dimensional data (point set) (step 2).
Here, a boundary line between two adjacent sections in the matrix generated by the data discretization unit 42 is used as a division plane. Here, intervals between adjacent division planes become constant. However, the intervals need not always be constant. There are m−1 division planes for each dimension.
The data provisional division unit 44 bisects multi-dimensional data by using the division plane obtained by the division plane candidate creator 43 and generates two clusters (step 3).
The model generator 45 generates models A and B respectively from the two clusters A and B obtained by the data provisional division unit 44 (step 4). In other words, the model generator 45 generates the model A by using input data belonging to the cluster A, and generates the model B by using input data belonging to the cluster B. The models A and B generated respectively from the clusters A and B are shown in
The evaluation value calculator 46 calculates an evaluation value for the above-described division on the basis of the models generated by the model generator 45 and the input data (step 5). Details of this calculation will be described hereafter.
An absolute value of a difference between y estimated from x by using the model and actual y is regarded as an error. As to points in the cluster A, errors from the model A are added up to calculate an error of the model A. As to points in the cluster B, errors from the model B are added up to calculate an error of the model B. The error of the model A and the error of the model B are added up. The result obtained by the addition is divided by the number of all points (the number of data) included in the cluster A and B. A resultant value is used as an evaluation value.
The evaluation value may be calculated as hereafter described. In other words, squares of differences each obtained between an estimated value of y and an actual y value for all points are added up. A resultant sum is divided by the number of all points, and a square root of a result of the division is used as an evaluation value.
In the case where the principal component analysis is used in the above-described model generation, if k-dimensional input data is supposed, a plane spanned by the first to the (k−1)th principal components is used as a model, and a distance between the model and a point is regarded as an error. Thereafter, an evaluation value is calculated in the same way as the case where the regression analysis is used.
The steps 3 to 5 heretofore described are conducted on each of the division plane candidates. As a result, an evaluation value is calculated with respect to each of division plane candidates.
The division candidate selector 47 selects a division plane candidate having a highest evaluation value (for example, a minimum evaluation value) from among as many generated evaluation values as the number of the division plane candidates (here, the number is 14) (step 6). If an end condition is satisfied (a continuation condition is not satisfied), however, the division candidate selector 47 outputs an end signal indicating the processing end, without selecting a division plane candidate. The end condition is, for example, that the minimum evaluation value does not become lower than a preset threshold.
The division/decision unit 48 divides input data (point group) by the division plane selected by the division candidate selector 47, and generates two new data sets (step 7). In order to repeat the processing (steps 2 to 7) conducted by the function units 43 to 47 with respect to each of the newly generated data sets, the division/decision unit 48 outputs each data set to the division plane candidate creator 43 (step 8). The division/decision unit 48 determines the end of the repetition processing, for example, as follows.
That is, the division/decision unit 48 sets a flag for each data set when sending data sets to the division plane candidate creator 43. If an end signal is input or division is conducted with respect to a certain data set, the flag for the data set is erased. If all set flags are erased, the processing end is determined. If an end signal is input in a first round of the flow chart shown in
As a result of the processing heretofore described, the input data is bisected recursively and clusters are generated.
According to the present embodiment, selecting a division plane that is minimized in error from the model and conducting data division (clustering) by using the selected division plane are repeated recursively as heretofore described. Therefore, multi-dimensional data can be divided into a plurality of clusters while properly reflecting the tendency other than the distance between points, which is immanent in the multi-dimensional data. Thereby, for example, it becomes possible to separate running history data into data differing in running situation, when creating a model for estimating a proper variation range of each sensor in the plant by using values of other sensors.
In the present embodiment, the evaluation value calculation conducted by the evaluation value calculator 46 will be described in more detail.
As described with reference to the first embodiment, data is divided into DAi and DBi (clusters Ai and Bi are generated) by a certain division plane candidate (denoted by φi) created by the division plane candidate creator 43, and models Ai and Bi and errors error_Ai and error_Bi are calculated respectively for DAi and DBi. Here, the error_Ai is the sum total of errors of data belonging to DAi, and the error_Bi is the sum total of errors of data belonging to DBi. The number of data belonging to DAi and the number of data belonging to DBi are denoted by num_Ai and num_Bi, respectively.
Model evaluation values error_adjust_Ai and error_adjust_Bi respectively for DAi and DBi are calculated by using the following equations.
error—adjust—Ai=error—Ai−α×num—Ai+β
error—adjust—Bi=error—Bi−α×num—Bi+β
As for α, for example, a value (error before division/the number of data before division) may be used. Here, β is a parameter for determining the cease of division.
As for an evaluation value error_adjust_i, a value obtained by providing error_adjust_Ai and error_adjust_Bi with respective weights to yield products and adding up resultant products may be used, or either error_adjust_Ai or error_adjust_Bi having a smaller value may be used. If the error_adjust_i is a threshold (for example, zero) or more, the division plane candidate φi is not adopted as a candidate for division.
Points shown in
Here, it is considered to be desirable to have a smaller model error. If the values are approximately the same, it is considered to be desirable to have a larger number of data included in the cluster. According to this criterion, points in the graph are desired to be located at the bottom right-hand corner as far as possible. In order to clarify the reference for selecting a best point, a reference line that passes through the origin and that has an inclination of α as shown in
Here, a line obtained by moving the reference line having the inclination α in the minus direction of the ordinate by β is referred to as threshold line. If the maximum evaluation line is on or under the threshold line, data division is conducted by using the division plane candidate having the maximum evaluation point. On the other hand, if the maximum evaluation line is above the threshold line, data division is stopped. In other words, the division candidate selector 47 outputs an end signal.
According to the present embodiment, the evaluation value is calculated by using errors before the division and the parameter for determining the cease of the division, as heretofore described. Therefore, a division plane candidate can be properly selected.
In the present embodiment, processing for merging clusters generated according to the first embodiment is added. Hereafter, the present embodiment will be described in detail.
Elements 41 to 48 are the same as those shown in
The merging candidate generator 51 generates cluster pairs by using all combinations from the clusters A, B, C and D. As a result, pairs (A, B), (A, C), (A, D), (B, C), (B, D) and (C, D) (merging candidates) are generated.
The merging candidate selector 52 selects the generated pairs one after another and outputs them to the model generator 53.
The model generator 53 conducts model generation on a set of points in each of input pairs.
The merging evaluation value calculator 54 calculates a merging evaluation value for each of the generated models. The merging evaluation value is calculated according to a function using, for example, model errors, the number of data and the number of models. In the case of the pair (A, B), the calculation is conducted as described below. It is now supposed that the errors of the models A and B are respectively error_A and error_B and the numbers of data are respectively num_A and num_B. Furthermore, it is supposed that an error of a model AB obtained when the clusters A and B are merged is error_AB and the number of data is num_AB. The error_A, error_B and error_AB can be calculated in the same way as the first embodiment. And a difference between (num_A/num_AB)*error_A+(num_B/num_AB)*error_B+1*γ and error_AB+2*γ is obtained as the merging evaluation value. Here, γ is a constant given by a user, and each of “1” and “2” represents the number of models.
If the merging evaluation value satisfies a predetermined reference (a merging reference), for example, if the merging evaluation value is equal to or less than a predetermined value, clusters in the pair are merged by the merging/decision unit 55. If a certain cluster belongs to a plurality of pairs that satisfy the predetermined reference, a pair having a lower merging evaluation value is given priority.
In the present example, there are pairs of six ways (A, B), (A, C), (A, D), (B, C), (B, D) and (C, D) as described above. Merging evaluation values for (A, C) and (B, D) satisfy the above-described predetermined reference. As a result, the merging/decision unit 55 merges the clusters A and C to generate a cluster E, and merges the clusters B and D to generate a cluster F. This state is shown in
The merging/decision unit 55 outputs the generated clusters (here E and F) and clusters that have not been merged (not present in the present example) to the merging candidate generator 51. The above-described processing is repeated with respect to these clusters. Since a merging evaluation value calculated from the pair (E, F) does not satisfy the predetermined reference, the merging/decision unit 55 terminates processing without merging the clusters E and F. In other words, the clusters E and F finally remain.
By the way, in the merging candidate generator 51, the cluster pairs may include only adjacent clusters. In this case, the number of pairs can be reduced.
According to the present embodiment, clusters are merged as heretofore described. Therefore, it is possible to prevent the number of clusters from increasing unnecessarily.
First, processing is conducted by a data input unit 61 and a data discretization unit 62 in the same way as the first embodiment. In the subsequent processing, either of data before the discretization and data after the discretization may be used as the multidimensional data. In the former case, the processing becomes fast, but the precision becomes lower. Conversely, in the latter case, the processing time becomes longer, but the precision becomes higher.
Subsequently, processing is conducted by a division plane candidate creator 63. Then, a data provisional division unit 64 divides input data into two clusters A and B using a certain division line candidate l. Subsequently, a model generator 65 generates models A and B respectively from clusters A and B. This state is shown in
Here, a grouping unit 66 regroups points (input data) according to the distance from the models. It is supposed that points located near the model A belong to the cluster A and that points located near the model B belong to the cluster B. This state is shown in
An evaluation value calculator 67 calculates an evaluation value on the basis of the clusters A and B and the models A and B after the re-grouping in the same way as the first or second embodiment, and outputs the calculated evaluation value to a division candidate selector 68.
Upon receiving evaluation values for all division line candidates, the division candidate selector 68 outputs grouping result corresponding to a division line candidate having a best evaluation value among them and the best evaluation value to a decision unit 69. If the best evaluation value satisfies a reference value determined by the user, the decision unit 69 terminates the processing. If the best evaluation value does not satisfy the reference value, the decision unit 69 passes each group to the division plane candidate creator 63. In the foregoing description, the processing conducted by the model generator 65, the grouping unit 66 and the evaluation value calculator 67 may also be repeated. In other words, the model generator 65 and the grouping unit 66 conduct the model generation and the grouping again, and the evaluation value calculator 67 calculates the evaluation value. The processing may be repeated until the evaluation value is not improved, i.e., until the variation in the evaluation becomes a certain value or less, or the processing may be repeated a certain number of times.
In a fifth embodiment, the division line (division plane) selected by the division candidate selector 47 shown in
It is now supposed that a division line l is selected for certain input data by the division candidate selector 47. It is supposed that division lines l− and l+ are adjacent to the division line l. The division candidate selector 47 creates new division line candidates between the lines l− and l+. As for the way of creating new division line candidates, the interval between l− and l+ may be simply divided into equal parts, or points included between l− and l+ may be separated. An example of drawing lines in a way that separates six points included between l− and l+ is shown in
According to the present embodiment, the division line l is offset in the ranges of adjacent division lines as heretofore described. Therefore, it becomes possible to conduct data division independently of intervals of division lines.
In the present embodiment, data division (clustering) is conducted while changing the combination of dimensions to be used. Hereafter, the present embodiment will be described in detail.
An example of four-dimensional input data is shown in
First, two dimensions are selected from among the explanatory dimensions, and a three-dimensional series including the two selected dimensions and the target dimension is supposed. In general, when the number of dimensions to be used is k, k−1 dimensions are selected from the explanatory dimensions. If a series formed of the x and z dimensions and the y dimension is selected, a series shown in
By the way, the explanatory dimension may be divided into two dimensions, i.e., a fixed explanatory dimension and an additional explanatory dimension. The fixed explanatory dimension is a dimension that is necessarily used, and the additional explanatory dimension is a dimension selected during the processing. For example, supposing the y dimension to be the target dimension, the x dimension to be the fixed explanatory dimension, and the z and w dimensions to be additional explanatory dimensions, the above-described processing is conducted for the combination of the x, y and z dimensions and the combination of the x, y and w dimensions.
According to the present embodiment, it becomes possible as heretofore described to conduct data division capable of generating high precision model in the case where the number of explanatory dimensions used for the data division is limited to a small number.
In the present embodiment, evaluation value calculation conducted by the evaluation value calculator 46 is improved on the basis of the first embodiment. A detailed configuration of an evaluation value calculator 71 in the present embodiment is shown in
In the present embodiment, dimensions used in the data division and division plane evaluation may be all or a part of the input data dimensions. Furthermore, dimensions used in the data division may be the same as or different from dimensions used in the division plane evaluation.
It is supposed that the input data is four-dimensional, three dimensions x, y and z are used for the data division, and four dimensions x, y, z and w are used for the division plane evaluation. Here, one certain dimension is referred to as target dimension and previously given. It is now supposed that the y dimension is the target dimension. Remaining dimensions are referred to as explanatory dimensions.
First, x, y, and z dimension data are processed in the data discretization unit 42, the division plane candidate creator 43, the data provisional division unit 44 and the model generator 45 according to the first embodiment.
The class number providing unit 73 in the evaluation value calculator 71 assigns a number to each cluster. This is referred to as class number. An example of generated clusters is shown in
The decision tree generator (classification rule generator) 74 in the evaluation value calculator 71 generates a decision tree (classification rule) having dimensions other than the target dimension y among the input data dimensions (i.e., the explanatory dimensions) as its attribute and having a class number as its class. An example of a decision tree generated from the data shown in
The expanded evaluation value calculator 75 in the evaluation value calculator 71 calculates an evaluation value e for each of division plane candidates in the same way as the first embodiment, and calculates values, such as a precision p of a decision tree corresponding to each of the division plane candidates and a depth d (having a depth 1 in the case of
The division candidate selector 47 (see
According to the present embodiment, the evaluation value is calculated taking elements such as the precision of the classification rule and the depth into consideration as heretofore described. Therefore, a division plane candidate can be selected properly.
In the present embodiment, processing in any of the above-described embodiments is conducted on a plurality of ways of combination of dimensions, and models are generated respectively from combinations of dimensions. Models corresponding to combinations of dimensions are evaluated, and data division corresponding to a model having highest evaluation is adopted. Hereafter, the present embodiment will be described in detail.
An element 81 denotes a plurality of data division apparatuses A, B, C . . . . The data division apparatuses A, B, C . . . are data division apparatuses according to any of the first to seventh embodiments. For example, the data division apparatuses A, B, C . . . are data division apparatuses according to the first embodiment, or the data division apparatuses A, B, C . . . are data division apparatuses according to the second embodiment. However, each of the data division apparatuses A, B, C . . . does not include a data input unit. In the present embodiment, a data input unit 82 common to the data division apparatuses A, B, C . . . is disposed.
It is supposed that input data supplied from the data input unit 82 to the data division apparatuses A, B, C . . . is the same, and that the target dimension is the same in the data division apparatuses A, B, C . . . However, the dimension used in the data division may differ for each of the data division apparatuses. For example, supposing the target dimension to be y, the data division apparatuses A, B, C . . . respectively utilize (x, y, z), (x, y, w), (z, y, w) for the data division. In addition, the number of dimensions may differ. In this case, (x,y), (y,z), (y,w) are utilized. As a result of processing, the data division apparatuses A, B, C . . . output models A, B, C . . . and data division candidates A, B, C . . . , respectively. The data division candidate A includes a plurality of clusters obtained as a result of data division, and the model A is a set of models corresponding to respective clusters. In the same way, the data division candidate B includes a plurality of clusters obtained as a result of data division, and the model B is a set of models corresponding to respective clusters. The data division candidate C includes a plurality of clusters obtained as a result of data division, and the model C is a set of models corresponding to respective clusters.
A class number providing unit (class number assigning unit) 83 provides each of clusters included in the data division candidates A, B, C . . . with a class number. The class number providing unit 83 provides each of data included in the input data with a class number.
A decision tree generator 84 generates a decision tree A, B, C . . . having dimensions other than the target dimension (i.e., the explanatory dimensions) as its attribute and having a class number as its class, for each of the data division candidates A, B, C . . . . Data used to generate the decision tree may be the same as the data used for the data division, or may be different from the data used for the data division. In the latter case, data is supplied from a data input unit 87 to the decision tree generator 84.
An expanded evaluation value calculator 85 calculates an expanded evaluation value based on the values e, p and d indicated in the seventh embodiment by using the decision tree A, B, C . . . for each of the models A, B, C . . . .
A best data-division selector 86 selects a data division candidate having highest evaluation among expanded evaluation values.
According to the present embodiment, a data division candidate capable of generating a high precision model can be determined.
Number | Date | Country | Kind |
---|---|---|---|
2005-152324 | May 2005 | JP | national |
Number | Name | Date | Kind |
---|---|---|---|
5335291 | Kramer et al. | Aug 1994 | A |
5444796 | Ornstein | Aug 1995 | A |
5590218 | Ornstein | Dec 1996 | A |
6581058 | Fayyad et al. | Jun 2003 | B1 |
Number | Date | Country | |
---|---|---|---|
20060269144 A1 | Nov 2006 | US |