This application is based upon and claims the benefit of priority of the prior Japanese Patent Application No. 2015-113462, filed on Jun. 3, 2015, and the prior Japanese Patent Application No. 2016-064660, filed on Mar. 28, 2016, the entire contents of which are incorporated herein by reference.
The embodiments discussed herein are related to apparatus and method for clustering data in streaming clustering without reducing precision.
Clustering processing is an important method that is the basis for artificial intelligence information processing such as image processing, speech recognition, natural language processing, sensor data processing, and DNA sequence mining. The clustering processing is broadly classified into a hierarchical clustering method which is typified by a self-organizing map or Ward's method, and a non-hierarchical clustering method which is typified by a k-means method.
With respect to the hierarchical clustering method, the self-organizing map has a disadvantage in that it is difficult to deal therewith since the convergence of calculation thereof is not obtained, and the Ward's method has a disadvantage in that it is required to calculate a distance between all data points, and in particular, calculation is difficult with respect to large-scale data.
Meanwhile, with respect to the non-hierarchical clustering method, the k-means method has a disadvantage in that it is required to give a cluster number in advance, and application is difficult with respect to an alien environment. In recent years, a DP-means method is proposed that is based on a non-parametric Bayesian method using a probability model and that automatically determines the cluster number according to complexity of data. The method is the non-hierarchical clustering method and has an advantage in that it is possible to dynamically determine the cluster number and the handling thereof is easy unlike the hierarchical clustering method.
Here, a summary of the DP-means method is described with reference to
The DP-means method updates the center of gravity of the cluster and the number of clusters is updated such that an objective function value φ (x, λ) which is illustrated below in Formula (1) is optimized.
φ(x,λ)=ΣD(x,u)+λ2k (1)
Here, x indicates a d-dimensional data point group, k indicates the number of clusters, and λ indicates a hyperparameter to determine the number of clusters. D(x, u) is a distance function, and for example, corresponds to a square Euclidean distance which is illustrated below in Formula (2).
D(x,u)=∥u−x∥2 (2)
Here, in the DP-means method, when optimization of the objective function value φ(x, λ) is performed, it is required to hold all of the data points in order to use all data points. In addition, in a case where clustering processing is performed with respect to a large number of data points, since an amount of calculation in calculation processing is increased proportionally to the number of data points which are held, the amount of calculation becomes very large. In this manner, a method in which clustering processing is performed in a state in which all data points are held is referred to as “batch clustering”.
In contrast to this, a method in which clustering processing is performed by using representative points extracted from all the data points is referred to as “streaming clustering”.
International Publication Pamphlet No. WO2011/142225, Japanese Laid-open Patent Publication No. 2002-304626, Japanese Laid-open Patent Publication No. 2010-134632, Japanese Laid-open Patent Publication No. 2013-182341 and Japanese Laid-open Patent Publication No. 10-171823 are examples of the related art.
“B. Kulis and M. Jordan, “Revisiting k-means: New Algorithms via Bayesian Nonparametrics”, In ICML2012.” is an example of the related art.
According to an aspect of the invention, an apparatus divides a feature value space in which input data points are to be disposed, into a plurality of local regions, and determines a representative point independently for each of one or more local regions each including at least one data point. In a case where a data point is added to a local region in which the representative point is disposed, the apparatus determines a new representative point to which a weight is assigned, based on the added data point and the representative point, and controls a number of clusters by using the new representative point.
The object and advantages of the invention will be realized and attained by means of the elements and combinations particularly pointed out in the claims.
It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory and are not restrictive of the invention, as claimed.
In order to perform streaming clustering, it is required to select representative points by using some methods. If the representative points are selected simply at random (called random sampling), there is a problem in that clustering precision is reduced. This is because, in the random sampling, in a case where imbalance is generated in the input data points and there is a cluster with few data points, clustering information is easily lost.
In streaming clustering, an example in which clustering precision is reduced is described with reference to
It is desirable to suppress precision reduction in streaming clustering.
As embodiments for the present application, applied examples of a data clustering method, an information processing device, and a data clustering program which will be described in detail with reference to the drawings. Here, the present disclosure is not limited to the applied examples.
The information processing device 1 includes a control unit 10 and a storage unit 20.
The control unit 10 corresponds to an electronic circuit such as a central processing unit (CPU). Then, the control unit 10 has an internal memory in order to store a program and control data which specify various processing, and thereby executes various processing. The control unit 10 includes a data acquisition unit 11, a representative point update unit 12, a representative point compression unit 13, and a clustering unit 14.
For example, the storage unit 20 is a storage device, such as a semiconductor memory element such as a RAM or a flash memory, a hard disk, or an optical disc. The storage unit 20 includes a grid list 21 and a representative point list 22. The grid list 21 indicates grid information which holds the representative points. The representative point list 22 indicates information on representative points which are held in the grid. For example, coordinates and weights of the representative points are included in the information on the representative points. The data structures of the grid list 21 and the representative point list 22 will be described later.
Here, the grid is described referring to
The weight of the representative point is a parameter for preventing the center of gravity position from being changed before and after the processing even if representative point compression processing is performed, and is assigned to the representative points information. In an example in
Here, the data structure of the grid list 21 is described with reference to
The data structure of the representative point list 22 is described with reference to
Returning to
The representative point update unit 12 updates representative points such that the number of representative points in the grid does not exceed the fixed number. The fixed number here indicates the number of data points which are able to be held in the grid, hereinafter referred to as “maximum number of held points”. For example, the representative point update unit 12 receives data points which are acquired by the data acquisition unit 11. There may be one or a plurality of received data points. The representative point update unit 12 determines whether there is a grid which includes the coordinates of received data points, among grids registered in the grid list 21. In a case where the representative point update unit 12 determines that there is a grid which includes coordinates of the data points, among grids registered in the grid list 21, the data points are added to the grid. At this time, the representative point update unit 12 sets the data point weight at one. In a case where the representative point update unit 12 determines that there is no grid which includes coordinates of the data points, among grids registered in the grid list 21, a grid is newly generated which includes the coordinates of the data points, and the data points are added to the generated grid. At this time, the representative point update unit 12 sets the data point weight at one, and the newly generated grid is added to the grid list 21. The representative point update unit 12 determines whether or not the number of representative points exceeds the maximum number of held points with respect to the respective grids which are registered in the grid list 21 after the received data points are added to the grid. In the representative point update unit 12, in a case where the number of representative points exceeds the maximum number of held points, representative point compression processing is executed and the representative points are updated such that the number of representative points in the target grid does not exceed the maximum number of held points. Representative point compression processing is executed by the representative point compression unit 13.
In the representative point compression unit 13, in a case where the number of representative points in the grid exceeds the maximum number of held points, the representative points are compressed such that the number of representative points in the grid does not to exceed the maximum number of held points. For example, the representative point compression unit 13 selects a new representative point from among the representative points which are included in the grid. The representative point compression unit 13 selects the closest representative point that is closest to the new representative point, from among the representative points which are included in the grid. The representative point compression unit 13 adds one to the weight of the new representative point, and deletes the closest representative point. That is, the representative point compression unit 13 causes the number of representative points in the grid to finally become the maximum number of representative points or less, by compressing the new representative point and the representative point closest to the new representative point. Here, as a selection method of the new representative point, a method of randomly sampling the representative points included in the grid, or a selection method of the representative points of a case in which the number of clusters is a fixed value, such as the k-means method, may be used. In addition, the maximum number of held points may be a fixed number which is different in each grid, and may be a fixed number which is the same in each grid. An example of a determination method of a maximum number of held points will be described later.
Here, representative point update processing according to the representative point update unit 12 is described with reference to
That is, when a grid including the coordinates of a new data point is registered in the grid list 21, the representative point update unit 12 adds a new data point to the grid. When a grid including the coordinates of a new data point is not registered in the grid list 21, the representative point update unit 12 generates a new grid, and adds the new data point to the newly generated grid. Here, a new data point group of a reference numeral p100 is merged in a grid g1, a new data group of a reference numeral p200 is merged in a grid g2, and a new data point group of a reference numeral p300 is added to a newly generated grid g4.
Then, the representative point update unit 12 determines whether or not the number of representative points exceeds the maximum number of held points with respect to the respective grids after the new data points are merged in the grid, and selects the grid in which the maximum number of held points is exceeded. Here, the maximum number of held points is set at a fixed number of ten points which is the same in each grid. In grids g1, g3, and g4, it is determined that the maximum number of held points 10 is not exceeded. In grid g2, there are 20 points, and it is determined that the maximum number of held points 10 is exceeded. As a result, the grid g2 is selected as a grid in which the maximum number of held points is exceeded.
Then, the representative point compression unit 13 selects a representative point in a grid in which the maximum number of representative points is exceeded, and compresses the selected representative point and a representative point closest to the selected representative point. The representative point compression unit 13 adds one to the weight of the selected representative point, and deletes the closest representative point. The representative point compression unit 13 repeats compression processing until the number of representative points within the grid becomes the maximum number of representative points. In this case, the representative point compression unit 13 selects the representative point p4 within the grid g2, and compresses the selected representative point p4 and the representative point p5 closest to the representative point p4. The representative point compression unit 13 adds one the weight of the representative point p4, and deletes the representative point p5. As a result, the weight of the representative point p4 becomes two.
Returning to
Here, clustering processing according to the clustering unit 14 is described with reference to
Here, an example of the grid is described with reference to
Another example of a grid is described with reference to
Next, an example of a determination method of a maximum number of held points is described with reference to
In other words, as indicated in
Here, furthermore, as indicated above, since d is sufficient for the maximum number of representative points to be held within a grid, it is possible to keep the precision of the cluster center of gravity by performing clustering processing under the assumption that there are d fixed clusters within a grid. For example, since a method is known in which an error of the center of gravity position of the cluster is suppressed to a constant factor in a case where the cluster number is fixed, such as a k-means method, the k-means method may be used as a compression algorithm, and when compression processing is performed with the cluster number d, it is possible to perform clustering processing with good precision. At this time, in the k-means method, the maximum number of held points of the held representative points may be 3d×log(d).
Representative Point Update Processing Flow Chart
Next, an operational flowchart for representative point update processing according to Applied Example 1 is described with reference to
First, the representative point update unit 12 removes data points x from the data point group X (step S11). The representative point update unit 12 determines whether or not there is a grid which includes data points x in a range thereof, within the grid list 21 (step S12). In a case where it is determined that there is no grid which includes data points x in the range thereof (step S12: No), the representative point update unit 12 generates a new grid which includes data points x in the range thereof, and adds grid coordinates 21b of the generated grid to the grid list 21 (step S13). Then, the representative point update unit 12 transitions to step S14.
Meanwhile, in a case where it is determined that there is a grid which includes the data points x in the region thereof (step S12: Yes), the representative point update unit 12 adds the data point x whose weight is set at 1.0, to the grid (step S14). The representative point update unit 12 determines whether or not data all the data points x are extracted from the data point group X (step S15). In a case where it is determined that not all the data points x are extracted (step S15: No), the representative point update unit 12 transitions to step S11 so as to extract the subsequent data points.
Meanwhile, in a case where it is determined that all the data points x are extracted (step S15: Yes), the representative point update unit 12 performs, if necessary, optimization processing on the grids (step S16). When the representative point update unit 12 generates a grid, there is a case where the center of the grid is not optimal. In this case, too many grids may be held, and there is a possibility that an unnecessary storage region is required.
Subsequently, the representative point update unit 12 extracts a grid from the grid list 21 (step S17). The representative point update unit 12 determines whether or not the number of representative points included in the grid is the maximum number of held points or less (step S18). In a case where it is determined that the number of representative points included in the grid is the maximum number of held points or less (step S18: Yes), the representative point update unit 12 transitions to step S20 without executing representative point compression processing.
In a case where the number of representative points included in the grid is not the maximum number of held points or less, that is, is greater than the maximum number of held points (step S18: No), the representative point update unit 12 executes representative point compression processing (step S19). The operational flowchart for representative point compression processing will be described later. Then, the representative point update unit 12 transitions to step S20.
In step S20, the representative point update unit 12 determines whether or not all the grids are extracted from the grid list 21 (step S20). In a case where it is determined that not all the grids are extracted (step S20: No), the representative point update unit 12 transitions to step S17 so as to extract the subsequent grids.
Meanwhile, in a case where it is determined that all the grids are extracted (step S20: Yes), the representative point update unit 12 ends representative point update processing, and outputs the grid list 21 as an output parameter.
Representative Point Compression Processing Flow Chart
Next, an operational flowchart for representative point compression processing according to Applied Example 1 is described with reference to
The representative point compression unit 13 selects a new representative point by which the maximum number of held points has been exceeded, from among the representative points which are acquired as the input parameter (step S21). For example, random sampling, the k-means method, and the like are given as a new representative point selection method. Then, the representative point compression unit 13 set the weights of new representative points at zero (step S22).
Subsequently, the representative point compression unit 13 sequentially selects a former representative point x which is held in a grid (step S23). The representative point compression unit 13 selects a new representative point that is closest to the selected former representative point x (step S24). The representative point compression unit 13 adds one to the weight of the selected new representative point (step S25). Then, the representative point compression unit 13 deletes the selected former representative point x from the representative point list 22 (step S26).
Subsequently, the representative point compression unit 13 determines whether or not all the former representative points are selected (step S27). In a case where it is determined that not all the former representative points are selected (step S27: No), the representative point compression unit 13 transitions to step S23 so as to select a subsequent former representative point. Meanwhile, in a case where it is determined that all the former representative points are selected (step S27: Yes), the representative point compression unit 13 ends the representative point compression processing, and outputs the coordinates and weight of the new representative points as output parameters.
Clustering Processing Flow Chart
First, the information processing device 1 sets the grid list 21 at an empty set (step S31). Then, the information processing device 1 updates the grid list 21 by using representative point update processing (step S32). Here, the flow chart of representative point update processing is indicated in
Then, the information processing device 1 executes clustering processing by inputting representative points (coordinates, weights) within the grid list 21 (step S33). For example, the weighted DP-means method is applied as clustering processing. Here, an operational flowchart of the clustering processing will be described later.
Then, the information processing device 1 ends clustering processing which uses representative point update processing, and outputs the clustering results as output parameters.
The clustering unit 14 sets centers of gravities of all representative points as a cluster center set U (step S41). The clustering unit 14 determines whether or not the clustering processing has converged (step S42). In a case where the clustering processing has not converged yet (step S42: No), the clustering unit 14 extracts a representative point x from the representative point group X (step S43). Then, the clustering unit 14 extracts, from the cluster center set U, a center u which is closest to the extracted representative point x (step S44).
Clustering unit 14 determines whether or not the distance between the representative point x and the center u is λ2 or more (step S45). In a case where it is determined that the distance between the representative point x and the center u is λ2 or more (step S45: Yes), the clustering unit 14 adds the representative point x to the cluster center set U (step S46). Then, the clustering unit 14 transitions to step S47.
Meanwhile, in a case where it is determined that the distance between the representative point x and the center u is not λ2 or more, that is, smaller than λ2 (step S45: No), the clustering unit 14 transitions to step S47. In step S47, the clustering unit 14 updates the label of the representative points x to the nearest cluster label (step S47).
Subsequently, the clustering unit 14 determines whether or not all the representative points x are extracted from the representative point group (step S48). In a case where it is determined that not all the representative points x are extracted (step S48: No), the clustering unit 14 transitions to step S43 so as to extract subsequent representative points.
Meanwhile, in a case where it is determined that all the representative points x are extracted (step S48: Yes), the clustering unit 14 extracts the cluster center u from the cluster center set U (step S49). The clustering unit 14 updates coordinate values of the extracted u to weighted center of gravity value of the representative point to which a label corresponding to u is assigned (step S50).
Then, the clustering unit 14 determines whether or not all the cluster centers u are extracted from the cluster center set U (step S51). In a case where it is determined that not all the cluster centers u are extracted (step S51: No), the clustering unit 14 transitions to step S49 so as to extract the subsequent cluster center. Meanwhile, in a case where it is determined that all the cluster centers u are extracted (step S51: Yes), the clustering unit 14 transitions to step S42 so as to carry out determination for convergence of the clustering processing.
In step S42, in a case where it is determined that that the clustering processing has converged (step S42: Yes), the clustering unit 14 outputs the cluster center set U (step S52), and the clustering processing ends.
Here, when the information processing device 1 performs clustering processing independently in each fixed period by using the representative points which are updated due to the representative point update processing according to Applied Example 1, it is possible to perform online clustering processing. Online clustering in which the representative point update processing is used according to Applied Example 1 is described with reference to
As indicated in
When time elapses, and a time point t is one, the information processing device 1 receives input data points. The information processing device 1 performs representative point update processing on the received input data points and the representative points which were updated when the time point t was zero. That is, when there already exists a grid which includes the input data points in the range, the information processing device 1 adds the input data points to the grid as representative points. When there exist no grids which include the input data points in the range, the information processing device 1 generates a new grid, and adds the input data points to the newly generated grid as representative points. The information processing device performs representative point compression processing and updates the representative points such that the number of representative points in each grid does not exceed the maximum number of held points. Then, the information processing device 1 performs batch clustering processing by inputting the updated representative points.
When time elapses, and a time point t is two, the information processing device 1 receives input data points. The information processing device 1 performs representative point update processing on the received input data points and the representative points which were updated when the time point t was one. That is, when there already exists a grid which includes the input data points in the range, the information processing device 1 adds the input data points to the grid as representative points. When there exist no grids which include the input data points in the range, the information processing device 1 generates a new grid, and adds the input data points to the newly generated grid as representative points. The information processing device performs representative point compression processing and updates the representative points such that the number of representative points in each grid does not exceed the maximum number of held points. Then, the information processing device 1 performs batch clustering processing by inputting the updated representative points.
In this manner, the information processing device 1 is able to perform online clustering processing by performing clustering processing, independently at each time point, on data points which are input by time division.
According to Applied Example 1, the information processing device 1 performs steaming clustering as below. That is, the information processing device 1 divides a feature value space in which input data points are disposed into grids. In a case where the information processing device 1 independently determines a representative point with respect to a grid including one or more data points. Thereafter, when data points are added to the grid in which a representative point is present, the information processing device 1 determines a new representative point by performing weighting based on the representative point and the added data points. Then, the information processing device 1 controls the number of clusters by using the new representative point. This allows the information processing device 1 to suppress reduction of clustering precision in streaming clustering. That is, in comparison to a case where the number of data points is randomly reduced, the information processing device 1 determines a weighted representative point as a representation of input data points, and is able to suppress reduction of clustering precision by performing clustering.
In addition, according to Applied Example 1, the information processing device 1 divides a feature value space into grids, each of which is a d-dimensional hypercube whose side length is λ/√{square root over (d)}, where λ is a threshold determining the cluster particle size, d is a number of dimensions of the feature value space. In this case, since the length of the diagonal line of the d-dimensional hypercube (the threshold which determines the cluster particle size) is λ, when at least one data point is included in a grid, the cluster therefore does not disappear. As a result, it is possible to avoid a case where a cluster disappears and errors of the number of clusters increases, which is caused by the information processing device 1 randomly determining the representative points.
In addition, according to the Applied Example 1, the information processing device 1 divides a feature value space into grids, each of which is a d-dimensional hypersphere whose diameter is a threshold λ determining a cluster particle size, where d is the number of dimensions of the feature value space. In this case, since the length of the diameter of the d-dimensional hypersphere is λ (the threshold which determines the cluster particle size), when at least one data point is included in a grid, it is assured that the cluster therefore does not disappear. As a result, it is possible to avoid a case where a cluster disappears and errors of the number of clusters increases, which is caused by the information processing device 1 randomly determining the representative points.
In addition, according to Applied Example 1, the information processing device 1 sets the maximum number of held points which are held in a grid, at the dimension number d. In this case, if only the dimension number d of representative points, as a maximum number of held points, remain in a grid, a representative point remains in the grid even if representative point compression processing is performed, thereby assuring that no clusters disappear. As a result, it is possible to avoid a case where a cluster disappears and errors of the number of clusters increases, which is caused by the information processing device 1 randomly determining the representative points.
In addition, according to Applied Example 1, for data points which have been time divided, the information processing device 1 determines a new representative point, based on the representative points that have been already determined and the data points which have been time divided. This allows the information processing device 1 to perform online clustering processing in which representative point update processing is used, and to suppress reduction of online clustering precision.
In addition, according to Applied Example 1, the information processing device 1 sets the maximum number of held points which are held in a grid, at 3×d×log(d). This allows the information processing device 1 to perform clustering processing with good precision, by setting the maximum number of held points of the representative points, which are held using the k-means method, at 3×d×log (d).
As for the information processing device 1 in Applied Example 1, description has been given of a case where a new cluster is generated when the distance between representative points are separated by λ or more. However, the information processing device 1 is not limited thereto, and in a case of falling into a local solution, a cluster may be partitioned by splitting a representative point. The local solution refers to a state in which it is difficult to reach an optimal solution.
Case of Splitting Representative Point
Here, the case of splitting a representative point is described with reference to
Here, x is a data point which belongs to data point group X of d dimensions which are input, u is a cluster center which belongs to a cluster center set C, k is the number of clusters, and λ is a parameter which determines a cluster particle size. In
Here, as a distance function which is represented by the cluster center u and the data points x, a square Euclidean distance is used, but the distance function is not limited thereto. For example, the distance function may be one that satisfies symmetry, such as a Manhattan distance or an L∞ distance, and may be one that does not satisfy symmetry, such as of KL divergence, a Mahalanobis distance, or an Itakura Saito distance.
As indicated in
In contrast, in the case of two clusters, the cluster centers u are (−1, 0), and (1, 0). Since (−1, 0), and (1, 0), which are set as the input data points, are respectively allocated to the nearest cluster label, the distance between the input data point (−1, 0) and the cluster center u is zero, and the distance between the input data point (1, 0) and the cluster center u is zero. Therefore, a value of an objective function L(X, C) (=20000×02+102×2) becomes 200. In the case of two clusters, the objective function value is smaller than that in the case of one cluster. In other words, in the case of one cluster, the objective function does not reach the minimum value, and falls into a local solution.
Consider the case when the weight of the input data points (−1, 0) is 10, and the weight of the input data points (1, 0) is 10. In the case of one cluster, a value of an objective function L(X, C) (=20×12+102×1) is 120. In contrast, in the case of two clusters, a value of an objective function L(X, C) (=200×02+102×2) is 200. In the case of one cluster, the objective function value is smaller than that in the case of two clusters. An optimal solution is obtained in the case of one cluster. That is, there is a case of falling into a local solution when the weight of a representative point included in a cluster increases. In that case, it is preferable to split the cluster.
In Applied Example 2, a case is described where, in the clustering using weighted representative points, the information processing device 1 splits a representative point when the weight of the representative point increases.
Information Processing Device Configuration According to Applied Example 2
The data structure of the representative point list 22A is described with reference to
Here, the representative point range 22d is described with reference to
Returning to
w≧16(λ/σn)2 (4)
Formula (4) is used as a condition under which the representative point is split, on the grounds that, when the formula (4) is satisfied, an expectation value of the objective function after splitting becomes smaller than before splitting. Here, a condition for splitting a representative point is described with reference to
As indicated in
V(x)=(b−a)2/12
A right-side diagram of
w(σ12/12+ . . . +σn2/12+ . . . )+λ2≧w(σ12/12+ . . . +σn2/48+ . . . )+2λ2 (5)
Here, in Formula (5), the expectation values of error terms of respective dimensions are added.
Here, Formula (5) above is represented by Formula (32) below.
w (an expectation value of distance between a data point belonging to the representative point before splitting and the representative point)≧w1 (an expectation value of distance between the data points belonging to representative point 1 after splitting and representative point 1)+w2 (an expectation value of distance between the data points belonging to representative point 2 after splitting and representative point 2)+λ2 . . . (32)
In Formula (32), w1 is a weight of representative point 1 of one cluster in the case of splitting, and w2 is a weight of representative point 2 of the other cluster in the case of splitting.
Returning to
It is assumed that a representative point is split into two in the nth dimension. At this time, the values of two representative points after splitting (u1, w1, p1, q1), (ur, wr, pr, qr) are respectively updated to Formula (6) to Formula (13) below, with respect only to values of the nth dimension, and, with respect to values other than the nth dimension, values of the original representative point are taken over.
u
1(n)=(u(n)+q(n))/2 (6)
w
1
=w
old(u(n)−q(n))/(p(n)−q(n)) (7)
p
1(n)=u(n) (8)
q
1(n)=q(n) (9)
u
r(n)=(u(n)+p(n))/2 (10)
w
r
=w
old(p(n)−u(n))/(p(n)−q(n)) (11)
p
r(n)=p(n) (12)
q
r(n)=u(n) (13)
In addition, in a case where the representative point splitting unit 31 adds data points to a representative point, a sequential update of the representative point is performed. For example, it is assumed that (center, weight, maximum value, minimum value) of a representative point are (uold, wold, Pold, qold), respectively, and an input data point is x. In this case, (center, weight, maximum value, minimum value) of the representative point after update become (unew, wnew, pnew, qnew), respectively. unew, wnew, pnew, and qnew are respectively represented by Equation (14) to Equation (17) below.
u
new=(wolduold+X)/(wold+1) (14)
w
new
=w
old+1 (15)
p
new=max(pold,x) (16)
q
new=min(qold,x) (17)
The clustering unit 14A executes batch clustering processing by inputting weighted representative points which are included in grids registered in the grid list 21. However, in the case of the weighted representative points, the objective function is changed from Formula (3) to Formula (18) below.
Here, x indicates a data point included in the input d-dimension data point group, u is the cluster center, k is the number of clusters, λ is a parameter which determines a cluster particle size, and w is a weight.
In this manner, since the objective function is changed, a condition in which a new cluster is generated is changed such that the new cluster is generated when the Euclid distance between data points is λ/√{square root over (w)} or more. That is, in a case where the distance between a representative point and a center that is closest to the representative point in the cluster center set is λ/√{square root over (w)} or more, the clustering unit 14A adds the representative point to the cluster center set. That is, the clustering unit 14A generates a new cluster including the representative point as the center thereof. Here, the above condition is one in a case where the Euclidean distance is used as the distance function to determine a cluster addition condition, but there exists an equivalent condition in a case where another function is used. For example, in a case where the squared Euclidean distance is used as the distance function, a condition in which a new cluster is generated is that data points are separated by λ2/w or more, but since the squared Euclidean distance is obtained by taking a square of an Euclidean distance, the condition is equivalent to a condition in which the Euclidean distance is used.
Representative Point Splitting Processing Flow Chart
An operational flowchart of representative point update processing according to Applied Example 2 is described with reference to
First, the representative point splitting unit 31 sets a corresponding representative point candidate x_a at null as an initial value, and sets a nearest distance d_a at λ2 (step S61). Here, the corresponding representative point candidate means a candidate for a representative point to be split.
The representative point splitting unit 31 determines whether or not all the representative points are extracted from the data point group X (step S62). In a case where it is determined that not all the representative points are extracted (step S62: No), the representative point splitting unit 31 extracts coordinates x_c, a weight w_c, a range r_c of one representative point from the representative point group X (step S63).
The representative point splitting unit 31 calculates a distance between the coordinates x_c of the representative point and the input data point x, and the calculated distance is set to d_c (step S64). The representative point splitting unit 31 determines whether or not the distance d_c is smaller than the nearest distance d_a (step S65). In a case where it is determined that the distance d_c is not smaller than the nearest distance d_a (step S65: No), the process transitions to step S62 so as to cause the representative point splitting unit 31 to extract subsequent representative points.
Meanwhile, in a case where it is determined that the distance d_c is smaller than the nearest distance d_a (step S65: Yes), the representative point splitting unit 31 calculates a range r_a after update, based on data points obtained by adding the input data point x to current representative point (step S66). For example, the representative point splitting unit 31 calculates the range after update in a case where current representative point is updated by utilizing Formula (16) and Formula (17).
The representative point splitting unit 31 calculates diameters o_a of the representative points from the range r_a after update (step S67), and extracts the maximum diameter from among the calculated diameters σ_a (step S68). Then, the representative point splitting unit 31 determines whether or not w_c+1≧16(λ/σ_a)2 and the input data point x is outside the range r_c of the current representative point (step S69). In a case where it is determined that w_c+1≧16(λ/σ_a)2 and the input data point x is outside of the range r_c of the current representative point (step S69: Yes), the process transitions to step S62 to cause the representative point splitting unit 31 to extract the subsequent representative points. That is, even if w_c+1≧16(λ/σ_a)2, when the input data point x is outside of the range r_c of the current representative point, the process transitions to step S62 so that the current representative point is not set as the corresponding representative point candidate.
Meanwhile, w_c+1<16(λ/σ_a)2 or it is determined that the input data points x are not outside the range r_c of the current representative point (step S69: No), the representative point splitting unit 31 sets the current representative points x_c as the corresponding representative points candidate x_a (step S70). Then, the process transitions to step S62 to cause the representative point splitting unit 31 to extract the subsequent representative points.
In step S62, in a case where it is determined that all the representative points are extracted (step S62: Yes), the representative point splitting unit 31 determines whether or not the corresponding representative point candidate x_a is null (step S71). In a case where it is determined that the corresponding representative point candidate x_a is null (step S71: Yes), the representative point splitting unit 31 adds the input data point x to the representative point group X as a representative point. At this time, the representative point splitting unit 31 sets the weight w of the added representative point at one, and the range R of the added representative point is set so that the maximum value is x and the minimum value is x (step S72). Then, the representative point splitting unit 31 ends representative point splitting processing.
Meanwhile, in a case where it is determined that the corresponding representative points candidate x_a is not null (step S71: No), the representative point splitting unit 31 updates the coordinates, weight, and range of the corresponding representative point candidate x_a (step S73). That is, the representative point splitting unit 31 adds the input data point x to the corresponding representative point candidate x_a. For example, the representative point splitting unit 31 adds the input data point x to the corresponding representative point candidate x_a, and updates the coordinates of corresponding representative point candidate x_a by utilizing Formula (14). The representative point splitting unit 31 adds the input data point x to the corresponding representative points candidate x_a, and updates the weight w_a of the corresponding representative point candidate x_a by utilizing Formula (15). The representative point splitting unit 31 adds the input data point x to the corresponding representative points candidate x_a, and updates the maximum value of the corresponding representative point candidate x_a by utilizing Formula (16). The representative point splitting unit 31 adds the input data point x to the corresponding representative points candidate x_a, and updates the minimum value of the corresponding representative point candidate x_a by utilizing Formula (17).
Then, the representative point splitting unit 31 determines whether or not the weight w_a and diameter σ_a after update satisfy w_a≧16(λ/σ_a)2 (step S74). In a case where it is determined that the weight w_a and the diameter σ_a after update satisfy w_a≧16(λ/σ_a)2 (step S74: Yes), the representative point splitting unit 31 splits the representative point of the corresponding representative point candidate x_a (step S75). For example, the representative point splitting unit 31 splits the representative point of the corresponding representative point candidate x_a by utilizing Formula (6) and Formula (13). Then, the representative point splitting unit 31 ends representative point splitting processing.
In a case where it is determined that the weight w_a and the diameter σ_a after update does not satisfy w_a≧16(λ/σ_a)2 (step S74: No), the representative point splitting unit 31 ends representative point splitting processing by not splitting the representative point of the corresponding representative point candidate x_a.
Clustering Processing Flow Chart
The clustering unit 14A sets centers of gravity of all the representative points in the cluster center set U (step S81). The clustering unit 14A determines whether or not the clustering processing has converged (step S82). In a case where it is determined that the clustering processing has not converged (step S82: No), the clustering unit 14A extracts a representative point x from the representative point group (step S83). Then, the clustering unit 14A extracts, from the cluster center set U, a center u having the closest distance from the extracted representative point x (step S84).
The clustering unit 14A determines whether or not the distance between the representative point x and the center u is λ/√{square root over (w)} or more (step S85). In a case where it is determined that the distance between the representative points x and the center u is λ/√{square root over (w)} or more (step S85: Yes), the clustering unit 14A adds the representative point x to the cluster center set U (step S86). Then, the clustering unit 14A transitions to step S87.
Meanwhile, in a case where it is determined that the distance between the representative points x and the center u is notλ/√{square root over (w)} or more, that is, smaller than λ/√{square root over (w)} (step S85: No), the clustering unit 14A transitions to step S87. In step S87, the clustering unit 14A updates the label of the representative point x to a label of the nearest cluster (step S87).
Subsequently, the clustering unit 14A determines whether or not all the representative points x are extracted from the representative point group (step S88). In a case where it is determined that not all the representative points x are extracted (step S88: No), the clustering unit 14A transitions to step S83 so as to extract the subsequent representative points.
Meanwhile, in a case where it is determined that all the representative points x are extracted (step S88: Yes), the clustering unit 14A extracts a cluster center u from the cluster center set U (step S89). The clustering unit 14A updates coordinate values of the extracted u to weighted center of gravity values of the representative point to which the label corresponding to u is assigned (step S90).
Then, the clustering unit 14A determines whether or not all the cluster centers u are extracted from the cluster center set U (step S91). In a case where it is determined that not all the cluster centers u are extracted (step S91: No), the clustering unit 14A transitions to step S89 so as to extract a subsequent cluster center. Meanwhile, in a case where it is determined that all the cluster centers u are extracted (step S91: Yes), the clustering unit 14A transitions to step S82 so as to carry out convergence determination.
Then, the clustering unit 14A transitions to step S82 to carry out convergence determination. In step S82, in a case where it is determined that that the clustering processing has converged (step S82: Yes), the clustering unit 14A outputs the cluster center set U (step S92), and the clustering processing ends.
According to Applied Example 2, the information processing device 1 holds the range of data points which are included in a new representative point. In a case where the weight of the new representative point exceeds a value that is inversely proportional to the square of the parameter indicating a range of data points in which the new representative point is included, the information processing device 1 splits the new representative point so that the new representative point is included in a different cluster. In this case, in a case where a value of the data point exceeds a predetermined value, the information processing device 1 is able to perform clustering so that the objective function value becomes small, by splitting the representative point so that new representative points are respectively included in different clusters. That is, even in a case where the representative points are utilized, the information processing device 1 is able to perform clustering without reducing precision.
In addition, according to Applied Example 2, when the information processing device 1 performs clustering processing by using the weighted representative points, the threshold of a distance from the cluster center which is permitted as the same cluster is set at a value which is inversely proportional to the square root of the weight of the representative point. In this case, in a case where the information processing device 1 performs weighted clustering, it is possible to perform clustering with good precision by setting the threshold of the distance from the cluster center which is permitted as the same cluster, at a threshold which is determined based on the weighting.
Here, in Applied Example 2, description was given of the clustering unit 14A which performs batch clustering processing by inputting weighted representative points and changing an objective function from Formula (3) to Formula (18). That is, the clustering unit 14A allocates respective representative points to the nearest clusters, and when a distance between the input weighted representative point and the nearest cluster center is λ/√{square root over (w)} or more, a new cluster is generated. Here, w is the weight of an input representative point, and λ is a parameter which determines a cluster particle size. However, the clustering unit 14A is not limited thereto, and two near weighted representative points may be clustered (integrated).
Therefore, in Applied Example 3, a case is described in which near weighted representative points are clustered (integrated) in clustering in which weighted representative points are used.
Information Processing Device Configuration According to Applied Example 3
The clustering unit 14B selects and integrates two representative points from among a plurality of representative points. For example, the clustering unit 14B selects two representative points, by using the cost function, from among the plurality of representative points, and integrates the two representative points in a case where a condition is matched in which the two representative points are integrated. Here, the cost function is a function which calculates a degree of improvement in a case in which the two representative points are assumed to be integrated. Here, detail description of the condition in which the cost function and the two representative points are integrated will be described later. In addition, it is possible to realize a process in which two targets are selected and integrated using a Merge algorithm. As the Merge algorithm, for example, a technique described in ‘J. Lee et al., “Online video segmentation by Bayesian split-merge clustering”, in ECCV2012′, may be used.
Here, a data structure of the cost function table 23 is described with reference to
The data structure of the representative point/cluster correspondence table 24 is described with reference to
The data structure of the cluster center set 25 is described with reference to
Clustering Processing Flow Example
As illustrated in the left view of
As illustrated in the center view of
As illustrated in the right view in
Then, the clustering unit 14B selects the two representative points having a maximum degree of improvement in the cost function by using the recalculated results, and continues the clustering processing until the selected two representative points do not satisfy the integration conditions.
Representative Points Integration Conditions
Here, a condition under which the representative points are integrated are described with reference to
As illustrated in
Here, in a case where the representative points are weighted, an expectation value of the objective function is calculated, according to Formula (18), by using a product of a weight value and an expectation value of distance with respect to the representative points. That is, the expectation value of the objective function in the case where two representative points are integrated into one is calculated, by using a product of a sum of weights of the two representative points and an expectation value of distance between a data point belonging to the two representative points and the representative point of an integration destination, and an item calculated from a parameter determining a cluster particle size. The expectation value of the objective function in the case where the two representative points are not integrated into one is calculated, by using a sum of a first product and a second product, and an item calculated from a parameter determining the cluster particle size, where the first product is a product of a weight of one representative point (first representative point) and an expectation value of distance between a data point belonging to the first representative point and the first representative point, and the second product is a product of a weight of the other representative point (second representative point) and an expectation value of distance between a data point belonging to the second representative point and the second representative point.
Cost Function
A cost function in a case where representative points have a uniform distribution is represented, for example, in Formula (19) below.
S
COST(C1,C2)=w1×d12+w2×d22−λ2 (19)
Here, C1 and C2 are representative points, respectively. w1 is a weight of C1, and w2 is a weight of C2. d1 is a distance between C1 and an integration-destination representative point (having a weighted center of gravity of C1 and C2) into which the representative points have been integrated, and d2 is a distance between C2 and the integration-destination representative point (having a weighted center of gravity of C1 and C2) into which the representative points have been integrated. λ is a parameter which determines the cluster particle size.
Derivation of the cost function which is represented by Formula (19) is described with reference to
As illustrated in
An expectation value k1 of error of a data point=(σ112/12+ . . . +σ1n2/12+ . . . ) (20)
Here, in Formula (20), expectation values of error terms of the respective dimensions are added.
In the same manner, an expectation value of error of a data point belonging to the representative point C2 is represented by Formula (21) below.
An expectation value k2 of error of a data point=(σ212/12+ . . . +σ2n2/12+ . . . ) (21)
Here, in Formula (21), expectation values of error terms of the respective dimensions are added.
A right-side view indicates a state after representative points are integrated, and is a case of one representative point. C1 and C2 are the same as in the left-side view. σ′1n is in a range between the representative point C1 and an integrated representative point into which the representative points C1 and C2 have been integrated in nth dimension, and d1 is a distance between C1 and the integrated representative point. σ′2n is in a range between the representative point C2 and the integrated representative point into which the representative points C1and C2 have been integrated in nth dimension, and d2 is a distance between C2 and the integrated representative point. Under such assumptions, an expectation value of error of dimension n of a data point belonging to the representative point C1 is represented by Formula (22).
(An expectation value of error of dimension n of a data point)
Accordingly, an expectation value of error of a data point belonging to the representative point C1 is represented by Formula (23) below.
Expectation value g1 of the error of the data points=(σ′112+ . . . +σ′in2+ . . . )+(σ112/12+ . . . +σin2/12+ . . . ) (23)
Here, in Formula (23), expectation values of error terms of the respective dimensions are added, and (σ′112+ . . . +σ′1n2+ . . . ) corresponds to d12 indicating a distance between the integrated representative point and C1.
In the same manner, an expectation value of error of a data point belonging to the representative point C2 is represented by Formula (24) below.
Expectation value g2 of the error of the data points=(σ′212+ . . . +σ′2n2+ . . . )+(σ212/12+ . . . +σ2n2/12+ . . . ) (24)
Here, in Formula (24), expectation values of error terms of the respective dimensions are added, and (σ′212+ . . . +σ′2n2+ . . . ) corresponds to d22 which indicates a distance between the integrated representative point and C2.
By substituting Formula (23) and Formula (24) for an objective function of a case of weighted representative points which is indicated by Formula (18), an expectation value M1 of an objective function in the case of one representative point is represented by Formula (25) below.
M
1
=w
1
×g
1
+w
2
×g
2+λ2={d12+(σ112/12+ . . . +σ1n2/12+ . . . )}+w2×{d22+(σ212/12+ . . . +σ2n2/12+ . . . )}+λ2 (25)
By substituting Formula (20) and Formula (21) for an objective function of a case of weighted representative points which is indicated by Formula (18), an expectation value M2 of an objective function in the case of two representative points is represented by Formula (26) below.
M
2
=w
1
×k
1
+w
2
×k
2+2λ2=w1×(σ112/12+ . . . +σ1n2/12+ . . . )+w2×(σ212/12+ . . . +σ2n2/12+ . . . )+2λ2 (26)
Since a cost function SCOST(C1, C2) is obtained by subtracting the expectation value M2 of the objective function in the case of two representative points, which is represented by Formula (26), from the expectation value M1 Formula (25) of the objective function in the case of one representative point, which is represented by Formula (25), the cost function Scosr (C1, C2) is able to be represented by Formula (19) described above. That is, when the cost function SCOST(C1, C2) is 0 or less, it is preferable to integrate two representative points C1 and C2.
In the above-mentioned examples, an expectation value of error of a data point is calculated by using a distance d between the integrated representative point and a representative point K. However, a method for obtaining an expectation value of error of a data point is not limited thereto. For example, information on representative points before integration may be stored, and a difference between a first distance between the representative point K and the representative point before integration, and a second distance between the representative point K and the integrated representative point, may be calculated as an error. A cost function SCOST(C1, C2) of a case where such an expectation value of error of a data point is used is represented by Formula (27) below.
Here, dold, K indicates a distance between the representative point K and the representative point before integration (cluster center), and dnew, K indicates a distance between the representative point K and the integrated representative point (cluster center).
In this way, by utilizing information on representative points before and after integration when the representative points are integrated, as an expectation value of error of a data point, it is possible to prevent errors from being accumulated due to loss of the representative points before integration.
Clustering Processing Flow Chart
The clustering unit 14B calculates a cost function value SCOST in a case of integration of representative points xi and xj, and substitutes the calculated result in the cost function table 23 (step S101). Here, i and j are representative point IDs (representative point numbers). The clustering unit 14B substitutes the representative point group X in the cluster center set U (step S102). Then, the clustering unit 14B initializes each cluster ID value of the representative point/cluster correspondence table 24 by using the representative point ID (step S103).
Subsequently, the clustering unit 14B acquires a representative point ID pair (i, j) which minimizes a cost function value, from the cost function table 23 (step S104). The clustering unit 14B determines whether or not the cost function value (i, j) is 0 or more (step S105). In a case where the cost function value (i, j) is determined to be 0 or more (step S105: Yes), the clustering unit 14B transitions to step S110.
Meanwhile, in a case where the cost function value (i, j) is determined to be less than 0 (step S105: No), the clustering unit 14B changes the cluster ID at which the representative point ID is “j” to “i” within the representative point/cluster correspondence table 24 (step S106). Or alternatively, the clustering unit 14B may change the cluster ID at which the representative point ID is “i” to “j”.
Then, the clustering unit 14B updates the cluster center of gravity for a cluster whose cluster ID within the cluster center set 25 is “i”, to a weighted center of gravity of representative points whose cluster IDs are “i” in the representative point/cluster correspondence table 24 (step S107). The clustering unit 14B deletes a cluster (a representative point) whose cluster ID is “j” within the cluster center set 25 (step S108).
Then, the clustering unit 14B recalculates the cost function value SCOST from the updated cluster information, and updates the values of the cost function table 23 (step S109). Then, the clustering unit 14B transitions to step S104.
In step S110, the clustering unit 14B outputs the cluster center set 25 (step S110). Then, the clustering unit 14B ends clustering processing.
According to Applied Example 3, when clustering processing is performed using the representative points, the information processing device 1 selects two representative points, and in a case where an expectation value of the objective function of a case where the two representative points are integrated into one point is smaller than an expectation value of the objective function of a case where the two representative points are not integrated into one point, the two representative points are integrated in one point. In this case, even in a case where representative points are utilized, the information processing device 1 is able to perform clustering without reducing precision.
In Applied Example 3, the clustering unit 14B integrates the two representative points in a case of matching a condition for integrating two representative points, and the integrated representative point into which the two representative points have been integrated is updated to a weighted center of gravity of the two representative points. However, the clustering unit 14B is not limited thereto, the integrated representative point may be updated as either one of two representative points in the case of matching a condition for integrating the two representative points.
Here, in a case of matching a condition for integrating two representative points, the clustering unit 14B according to Applied Example 4 integrates the two representative points by updating the integrated representative point as either one of the two representative points.
Information Processing Device Configuration According to Applied Example 4
Since the configuration of the information processing device 1 according to Applied Example 4 has the same configuration as the information processing device 1 according to Applied Example 3, the same reference numerals are used, and description of overlapping configuration will be omitted.
Clustering Processing Summary
Clustering Processing Flow Example
As illustrated in the upper stage left-side view of
As illustrated in an upper stage center view of
As illustrated in the upper stage right-side view in
As illustrated in the lower stage left-side view in
Cost Function
Here, according to Applied Example 4, a cost function in a case where representative points have a uniform distribution is represented, for example, in Formula (28) below.
S
mCOST(C1,C2)=w2×d2−λ2 (28)
Here, C1 and C2 are respective representative points. w2 is a weight of C2. d is a distance between the representative point C1 and the representative point C2. λ is a parameter which determines a cluster particle size.
Derivation of the cost function which is represented by Formula (28) is described with reference to
As illustrated in
A right-side view indicates a state after the representative points are integrated, and is a case of one representative point. C1 and C2 are the same as those in the left-side view. σ′n is in a range between the representative point C1 and representative point C2 in nth dimension, and d is a distance between the representative point C1 and the representative point C2. Under such assumptions, an expectation value of error of dimension n of a data point belonging to the representative point C2 is represented by Formula (29) below.
An expectation value of error of dimension n of a data point belonging to C2
Accordingly, an expectation value of error of a data point belonging to the representative point C2 is represented by Formula (30) below.
Expectation value g3 of error of a data point=(σ′12+ . . . +σ′n2+ . . . )+(σ212/12+ . . . +σ2n2/12+ . . . ) (30)
Here, in Formula (30), expectation values of error terms of respective dimensions are added. Here (σ′12+ . . . +σ′n2+ . . . ) corresponds to d2 which indicates a distance between the representative point C1 and the representative point C2.
An expectation value of error of a data point belonging to the representative point C1 is the same as that in Formula (20).
By substituting Formula (30) and Formula (20) into an objective function of the case of weighted representative points which are indicated by Formula (18), an expectation value M3 of an objective function in the case of one representative point is represented by Formula (31) below.
M
3
=w
1
×k
1
+w
2
×g
3
=w
1×(σ112/12+ . . . +σin2/12+ . . . )+w2×{d2+(σ212/12+ . . . +σ2n2/12+ . . . )}+λ2 (31)
An expectation value of the objective function in the case of two representative points is the same as M2 which is represented by Formula (26).
Since a cost function SmCOST (C1, C2) is obtained by subtracting the expectation value M2 of the objective function in the case of two representative points, which is represented by Formula (26), from the expectation value M3 of the objective function in the case of one representative point, which is represented by Formula (31), the cost function SmCOST (C1, C2) is represented by the above mentioned Formula (28). That is, when the cost function SmCOST (C1, C2) is 0 or less, it is preferable to integrate two representative points C1 and C2.
Clustering Processing Flow Chart
The clustering unit 14B calculates a cost function value SmCOST in a case of integration of representative points xi and xj, and the calculated result is substituted into the cost function table 23 (step S121). Here, i and j are representative point IDs (representative point numbers). The clustering unit 14B substitutes the representative point group X into the cluster center set U (step S122). Then, the clustering unit 14B initializes each cluster ID value of the representative point/cluster correspondence table 24 using the representative point ID (step S123).
Subsequently, the clustering unit 14B acquires a pair of representative point IDs (i, j) minimizing the cost function value, from the cost function table 23 (step S124). The clustering unit 14B determines whether or not the cost function value (i, j) is 0 or more (step S125). In a case where the cost function value (i, j) is determined to be 0 or more (step S125: Yes), the clustering unit 14B transitions to step S129.
Meanwhile, in a case where the cost function value (i, j) is determined to be less than 0 (step S125: No), the clustering unit 14B changes the cluster ID associated with the representative point ID “j”, to “i” within the representative point/cluster correspondence table 24 (step S126). Or alternatively, the clustering unit 14B may change the cluster ID associated with the representative point ID “i”, to “j”.
Then, the clustering unit 14B deletes a cluster whose cluster ID is “j” within the cluster center set 25 (step S127). The clustering unit 14B updates a value that is associated with each pair of representative point IDs including “j” in the cost function table 23, to “o” (step S128). Then, the clustering unit 14B transitions to step S124.
In step S129, the clustering unit 14B determines whether or not integration has occurred one time or more (step S129). In a case where it is determined that integration has occurred one time or more (step S129: Yes), the clustering unit 14B performs the following processing on a cluster corresponding to representative points for which the integration has occurred. That is, the clustering unit 14B updates a cluster center of a cluster within the cluster center set 25 to a weighted center of gravity of representative points which are assigned to the cluster in the representative point/cluster correspondence table 24 (step S130).
Then, the clustering unit 14B recalculates the cost function value SmCOST from the updated cluster information, and updates the values of the cost function table 23 (step S131). Then, the clustering unit 14B transitions to step S124.
Meanwhile, in a case where it is determined that integration has not occurred one time or more (step S129: No), the clustering unit 14B outputs the cluster center set 25 (step S132). Then, the clustering unit 14B ends clustering processing.
Applied Example 4 Effects
According to Applied Example 4, the information processing device 1 selects either one of the two representative points as a representative point of the integration destination, and furthermore, when the integration operation ends, recalculates coordinates of a representative point into which the two representative points have been integrated, by using information on the integrated representative point. In this case, the information processing device 1 does not move the coordinates of the representative point of the integration destination until the integration operation ends. As a result, the information processing device 1 is able to prevent representative points to be originally integrated, from not being integrated due to movement of coordinates of the representative points caused by the integration operation. In addition, even in a case where representative points are utilized, the information processing device 1 is able to perform clustering without reducing precision.
Here, objective function values, which are optimized using respective pieces of clustering processing that are described in Applied Examples 3 and 4, are calculated as below by experimentation. First, an objective function value which is optimized by clustering processing using the DP-means method is 7.33×1011, and the calculation time therefor is 321.1219 seconds. An objective function value which is optimized by clustering processing according to Applied Example 3 is 3.32×1011, and the calculation time therefor is 435.4864 seconds. An objective function value which is optimized by clustering processing according to Applied Example 4 is 3.24×1011, and the calculation time therefor is 434.2837 seconds.
As mentioned above, objective function values which are optimized using clustering processing in Applied Examples 3 and 4 are small in comparison to other methods. That is, the smaller objective function value is obtained by performing the integration operation utilizing the Merge algorithm. That is, the objective function value is further improved when the integration operation is performed utilizing the Merge algorithm.
Others
Here, in Applied Example 1, the clustering unit 14 executes batch clustering processing by inputting weighted representative points within the grid list 21. At this time, it is described that the clustering unit 14 executes the objective function as Equation (3). However, the clustering unit 14 is not limited thereto, and may perform the objective function as Equation (18).
In addition, in Applied Example 1, it is described that the representative point compression unit 13 sets a new representative point by merging another representative point with a representative point which is selected from among representative points included in the grid so that the number of representative points included in the grid does not exceeds the maximum number of held points. However, the representative point compression unit 13 is not limited thereto. Nearby representative points are selected, an average position of the selected nearby representative points is calculated, and the average position may be set as a new representative point.
In addition, each configuration element of the illustrated information processing device 1 does not have to be physically configured as illustrated. That is, the specific mode of dispersion and integration of the information processing device 1 is not limited to the illustration, and it is possible to configure the entirety or a portion thereof so as to be functionally or physically dispersed or integrated in arbitrary units, depending on various loads, usage conditions, and the like. For example, the representative point update unit 12 and the representative point compression unit 13 may be integrated as one unit. In addition, the representative point update unit 12 may be dispersed to a generation unit which generates the grid, an addition unit which adds the data points to the grid, and a compression unit which causes the representative point compression unit 13 to compress the data. In addition, a storage unit 20 may be connected via a network, as an external device of the information processing device 1.
In addition, various processes which are described in the applied examples above are able to be realized by executing a program prepared in advance, by using a computer such as a personal computer or a work station. Therefore, an example of a computer which executes a data clustering program that realizes the same function as the information processing device 1 which is indicated in
As indicated in
For example, the drive device 213 is a device for a removable disk 210. The HDD 205 stores a data clustering program 205a and data clustering related information 205b.
The CPU 203 reads the data clustering program 205a, loads the same into the memory 201, and executes a process. The process corresponds to each functional unit of the information processing device 1. The data clustering related information 205b corresponds to the grid list 21 and the representative point list 22. Then, for example, a removable disk 211 stores each set of information such as the data clustering program 205a.
Here, the data clustering program 205a may not necessarily be initially stored in the HDD 205. For example, a program is stored in a “portable physical medium” such as a floppy disk (FD), a CD-ROM, a DVD disc, an optical disc, or an IC card, which are inserted in the computer 200. Then, the computer 200 may be configured to read the data clustering program 205a therefrom to execute the same.
All examples and conditional language recited herein are intended for pedagogical purposes to aid the reader in understanding the invention and the concepts contributed by the inventor to furthering the art, and are to be construed as being without limitation to such specifically recited examples and conditions, nor does the organization of such examples in the specification relate to a showing of the superiority and inferiority of the invention. Although the embodiments of the present invention have been described in detail, it should be understood that the various changes, substitutions, and alterations could be made hereto without departing from the spirit and scope of the invention.
Number | Date | Country | Kind |
---|---|---|---|
2015-113462 | Jun 2015 | JP | national |
2016-064660 | Mar 2016 | JP | national |