This application is based upon and claims the benefit of priority from the prior Japanese Patent Applications No. 2005-176700 filed on Jun. 16, 2005, the entire contents of which are incorporated herein by reference.
1. Field of the Invention
The present invention relates to a clustering apparatus, a clustering method, and a program.
2. Description of the Background
Needs of data analysis for numerical information such as sensor data at a factories or the like to conduct an output prediction or abnormality detection are increasing. For observed numerical data, there is a mechanism which makes its ground. If the mechanism is sufficiently elucidated, it is possible to construct a strict mathematical model and obtain predicted values from the mathematical model.
In general, however, if a system becomes complicated, it becomes difficult to construct a high precision model which makes strict calculations possible, by numerical equations.
Therefore, it is conducted to construct a model from observed data by using an analysis technique such as data mining. When plural sensor outputs are obtained, the observed data are multi-dimensional data including plural variables. For constructing a model from observed data, it is indispensable to know correlation among variables. In the case where correlation among variables is complicated, it is frequently conducted to divide the data into several sets.
For example, it is supposed that there is a scattering diagram of two variables. It is supposed that this scattering diagram includes broadly two kinds of data groups, i.e., data existing in close vicinity to a certain straight line L1 and data existing in close vicinity to another straight line L2. In this case, it is suitable to divide data into two kinds of data groups and conduct analysis.
If it is not known previously that data is classified into the two straight lines, then it is necessary to conduct processing for automatically dividing data into plural data groups, i.e., clustering processing.
In the conventional clustering technique, however, a desired clustering result, i.e., a clustering result close to intuition of a human being cannot be obtained in some cases. For example, a data group in close vicinity to a certain straight line is often divided in separate clusters.
According to an aspect of the present invention, there is provided with a clustering apparatus comprising: an initial cluster generator configured to divide multi-dimensional data to generate a plurality of clusters each including one or more data pieces; a cluster recorder configured to record the clusters generated; a cluster selector configured to calculate parameters of a previously given model which is common to the clusters, from each of the clusters, and select clusters to be unified on the basis of the parameters calculated from each cluster; a cluster unifier configured to unify clusters selected by the cluster selector to generate a new cluster; and a cluster evaluator configured to calculate an evaluation value for evaluating a set of the clusters except the unified clusters and the new cluster.
According to an aspect of the present invention, there is provided with a clustering method comprising: dividing multi-dimensional data to generate a plurality of clusters each including one or more data pieces; recording the clusters generated; calculating parameters of a previously given model which is common to the clusters, from each of the clusters; selecting clusters to be unified on the basis of the parameters calculated from each cluster; unifying clusters selected to generate a new cluster; calculating an evaluation value for evaluating a set of the clusters except the unified clusters and the new cluster; and returning to the selecting in a case where the evaluation value does not satisfy a threshold value.
According to an aspect of the present invention, there is provided with A computer program, comprising instructions for: dividing multi-dimensional data to generate a plurality of clusters each including one or more data pieces; recording the clusters generated; calculating parameters of a previously given model which is common to the clusters, from each of the clusters; selecting clusters to be unified on the basis of the parameters calculated from each cluster; unifying clusters selected to generate a new cluster; calculating an evaluation value for evaluating a set of the clusters except the unified clusters and the new cluster; and returning to the selecting in a case where the evaluation value does not satisfy a threshold value.
The clustering apparatus shown in
The database 12 stores multi-dimensional data having a sequence length n. An example of two-dimensional data having a sequence length of 9 is shown in
The initial cluster generator 11 generates initial clusters from multi-dimensional data stored in the database 12 (S1). The initial clusters are generated by, for example, dividing the multi-dimensional data like mesh
Nine data included in the multi-dimensional data shown in
The initial cluster generator 11 records the generated clusters C1, C2 and C3 in the cluster recorder 14.
The cluster selector 15 selects clusters to be unified, from a cluster set recorded in the cluster recorder 14. Specifically, the cluster selector 15 calculates parameters of a previously given model which is common to the clusters, from each of the clusters (S2), and selects clusters to be unified, on the basis of the calculated parameters of respective clusters (S3). Hereafter, an example in which clusters C1, C2 and C3 are used as the cluster set and a straight line y=ax+b is used as the previously given model will be described.
Parameters of a straight line model are a gradient “a” and an intercept “b.” A data set belonging to a cluster Ci (i=1, 2, 3) is described as Di. Model Parameters of the straight line calculated from data of Di are denoted as (ai, bi). If |Di|≧2, the parameters of the straight line can be calculated as follows:
An error Ei of a cluster is calculated according to the following equation using the parameters found by the equation (1).
The error of the cluster means a deviation between the model and the actual data.
Parameters of the clusters C1, C2 and C3 are found according to the equation (1) as C1:(a1, b1)=(1, 0), C2:(a2, b2)=(1, 0) and C3:(a3, b3)=(0, 2). Straight lines having respective parameters are drawn on the coordinate system in
Handling “ai” representing a gradient of a straight line and “bi” representing a y-intercept with the same weight, a distance D between two clusters C1:(a1, b1) and C2:(a2, b2) is calculated as follows:
Or laying weight on the gradients of the two clusters, the distance D may be calculated as follows:
Here, A is a positive constant greater than unity.
The case where the multi-dimensional data are two-dimensional has been described heretofore. Alternatively, multi-dimensional data having a higher dimension may also be used.
In general, when data are plotted on an n-dimensional space, a hyperplane can be represented by using (n+1) coefficients ai (i=0, 1, . . . n) (here, n coefficients among them are independent) as follows:
If there are N pieces of data in n-dimensional data as shown in
From the condition in the brackets in the equation (5), a0 can be determined. Eventually, all of ai (i=0, 1, . . . n) can be determined.
A cluster error can be calculated as follows:
In the n-dimensional space, a distance between clusters can be defined using (n+1) coefficients ai (i=0, 1, . . . n). For example, the distance between the two clusters C1: si (i=0, 1, . . . n) and C2: ti (i=0, 1, . . . n) can be defined as follows:
Referring back to
The cluster evaluator 13 calculates an evaluation value for evaluating a cluster set (a set of the clusters C12 and C3) in the cluster recorder 14, and determines whether the evaluation value has reached a threshold value (S5).
For example, a decision is made according to whether the number of clusters in the cluster set has reached a predetermined number K.
If the cluster evaluator 13 judges the evaluation value not to have reached the threshold value (NO at S5), then the processing returns to the step S2 or S3. If the evaluation value has reached the threshold value (YES at S5), then the processing is finished.
In stead of judging whether the number of clusters has reached a predetermined number K, the following method may be taken. That is to say, the processing is finished when a reference value (such as 2k+(E1+E2+ . . . +Ek)/K) calculated using the number k of clusters and errors Ei of respective clusters (where the error and the model parameters of the unified cluster are calculated separately) has changed from a fall to a rise at a timing of the cluster unification.
First, the initial cluster generator 11 generates initial clusters by using the database 12, and records the generated initial clusters into the cluster recorder 14 (S11). Furthermore, the initial cluster generator 11 substitutes a sufficient great value into an evaluation parameter X as its initial value (S12).
The cluster selector 15 deletes clusters which are one or less in the number of data, from the cluster set in the cluster recorder 14, and substitutes the total number of clusters after deletion into K (S13).
The cluster selector 15 calculates model parameters from each of clusters by using data belonging to each cluster according to the equation (1). At the same time, the cluster selector 15 calculates the cluster error of each of the clusters according to the equation (2) (S14).
The cluster selector 15 calculates a distance between two clusters for all pairs of two clusters according to the equation (3), and selects, for example, a pair of two clusters having a shortest distance (S15).
The cluster unifier 16 unifies the selected two clusters into one cluster (S16). The cluster unifier 16 or the cluster selector 15 calculates a model parameter according to the equation (1) and an error according to the equation (2) on the unified cluster, and subtracts 1 from the total number K of clusters (S16).
The cluster evaluator 13 calculates an evaluation value X1 by using, for example, the relation X1=2K+(E1+ . . . Ek)/K (S17), and compares the evaluation value X1 with the evaluation parameter X (S18). If the evaluation value X1 is equal to or less than the evaluation parameter X (NO at S18), then the cluster evaluator 13 substitutes X1 into X (S19), and returns to the step S15. On the other hand, if the evaluation value X1 is greater than the evaluation parameter X (YES at S18), then the cluster unified immediately before is restored to the two original clusters (S20) and the processing is finished.
Effects obtained by the present embodiment will be described as compared with the conventional case.
Clustering is conducted on the initial clusters shown in
In the case where clusters are unified on the basis of distances between cluster-centers according to a conventional method, calculation of gravity points of the clusters C1, C2 and C3 provides C1:(2, 2), C2:(6, 6) and C3:(6, 2) on the basis of two-dimensional data shown in
On the other hand, if y=ax+b is adopted in the present embodiment as the model as described above, then the combination of the clusters C1 and C2 is selected as a unification candidate and the clusters C1 and C2 are unified. Therefore, in the present embodiment, clustering (data division) close to the intuition of human being becomes possible.
The case where the initial clusters C1, C2 and C3 are made as shown in
In more detail, a straight line (y=ax+b) is found from data contained in an initial cluster by using a least square method. And a deviation of actual data from the straight line, i.e., an error is calculated. As for initial cluster having an error which reaches at least a specified value, the initial cluster is divided into pieces (i.e. plural clusters). For example, the initial cluster is divided using planes (or straight lines) disposed at predetermined intervals so as to be perpendicular to the abscissa axis and planes (or straight lines) disposed at predetermined intervals so as to be perpendicular to the ordinate axis. This processing is conducted by, for example, the initial cluster generator 11.
In the case of
In the present embodiment, the case where a segment is used as a model will be described.
Here, as for the method for getting a segment on the basis of data belonging to a cluster (for example, an initial cluster), either a method of selecting two data from the cluster and using the selected two data as both end points of a segment or a method of finding a straight line on the basis of the data belonging to the cluster by using the least square method and cutting out a straight line portion contained in the cluster, may be used. Or, a method of finding a vector parallel to a segment on the basis of an axis which becomes a first main component by using a main component analysis, calculating a straight line so as to pass through a gravity point of data from the vector, and then cutting out a straight line portion contained in the cluster may be used.
The model parameters of the segment are directly represented as coordinates of both end points of the segment. In determining whether to unify two clusters, three parameters, i.e., a segment length ratio I between two segments, an angle θ formed by the segments, and a distance d between gravity points of the segments (gravity point distance) are used as evaluation indexes.
It is supposed that the two segments are a segment x1x2 and a segment y1y2. The end points of the segment x1x2 have coordinates x1=(x11, x12, . . . x1n) and x2=(x21, x22, . . . x2n), The end points of the segment y1y2 have coordinates y1=(y11, y12, . . . y1n) and y2=(y21, y22, . . . y2n). A center coordinate of the segment may be selected as the gravity of the segment, or a gravity point of data belonging to a segment region (described later) of the segment may be selected as the gravity point of the segment. If the center coordinate of the segment are used as the gravity point of the segment, the gravity point distance d is given by
A cosine of an angle formed by the two segments is given by
The segment length ratio I is given by
In the present embodiment, the distance between clusters is judged using the distance index (I, d, cos θ). For example, if the distance index between the cluster C1 and the cluster C2 is (I1, d1, cos θ1), then closeness between clusters is calculated by using
by giving weights to the all elements in the distance index (I1, d1, cos θ1). Here, A1, A2 and A3 are suitable positive constants.
Or the distance between clusters may be defined as
using the distance d and angle θ in order to collect parallel segments in the neighborhood.
A pair of clusters in which the value obtained by using the equation (12) or the equation (13) is minimized is selected, and the selected clusters are unified.
Here, the clusters may be unified as hereafter described.
First, re-clustering is conducted by using segments obtained from each cluster. In other words, data belonging to a segment region which is a definite distance r or less from the segment is regarded as a cluster (segment cluster). An example of a segment region formed by a segment AB is shown in
If subject data is two-dimensional data, then an n-th order polynomial equation
y=a0+a1x+a2x2+ . . . +anxn (14)
may be used as a model instead of a straight line.
For example, if a model is formed using a quadratic polynomial, the distance between clusters can be calculated using three parameters (a0, a1, a2) in y=a0+a1x+a2x2. Supposing that there are N sets of data (x1, y1), (x2, y2), . . . , (xN, yN) in a cluster, respective parameters can be found as follows:
Denoting parameters of the cluster 1 by (a01, a11, a21) and parameters of the cluster 2 by (a02, a12, a22), the distance D between the clusters can be calculated, for example, as follows:
Number | Date | Country | Kind |
---|---|---|---|
2005-176700 | Jun 2005 | JP | national |