This application is a National Stage of International Application No. PCT/JP2015/001101 filed Mar. 3, 2015, claiming priority based on Japanese Patent Application No. 2014-054451, filed Mar. 18, 2014, the contents of all of which are incorporated herein by reference in their entirety.
The present invention relates to an information processing device and a clustering method.
Processing for analyzing a large amount of data of a plurality of types are being performed on a wide variety of systems today. For example, combinations of data of types that are highly related to one another are extracted from among data of a plurality of types, and the extracted combinations of data are used to perform statistical processing or prediction processing. If data to be analyzed contains data having different characteristics, the accuracy of the analytical processing will be decreased or the analysis will be impossible.
Consider, for example, analysis of the relationship between an input packet rate and a central processing unit (CPU) utilization rate in a computer system by using an approximate line obtained by a least-square method or the like. If the computer system performs operations that are different between day and night, such as performing business processing during the daytime and batch processing during the night-time, there will be a significant difference in CPU utilization rate with respect to the input packet rate between day and night. In this case, an approximate line obtained from the mixture of daytime data and nighttime data is likely to be unfit for actual operation of the system.
Such analytical processing therefore requires classification (clustering) of data to be analyzed into clusters each of which includes data having the same characteristic by taking into consideration the characteristic of the data in advance.
A technique relating to such clustering in analytical processing is disclosed in PTL1, for example, which is a capacity management support apparatus that calculates a distribution density function for data combinations of particular types to classify data to be analyzed. NPL1 also discloses a technique that uses cross validation or Bayesian estimation to extract combinations that are in a close relation among data of a plurality of types and classify the data.
A related technique is disclosed in PTL2 which is an operation management apparatus that predicts an item of performance information concerning a system from another item of performance information on the basis of a correlation model of the system. PTL3 discloses another related technique which is an image data classifying apparatus that classifies image data on the basis of a plurality of types of distance definitions.
However, the techniques disclosed in PTL1 and PTL2 require calculation of distribution density functions for data to be analyzed or analysis using cross validation or Bayesian estimation. Accordingly, these techniques have a problem that it takes much time to classify data due to a high processing load.
An object of the present invention is to solve the problem described above and provide an information processing device and a clustering method which provide fast classification of data according to characteristics.
An information processing device according to an exemplary aspect of the invention includes: a data storage means for storing a plurality of data sets; and a cluster generation means for generating an approximate line that approximates as many data sets as possible within a predetermined margin of error among the plurality of data sets in a space in which the plurality of data sets are arranged in accordance with data values, and generating a cluster by classifying the plurality of data sets based on the generated approximate line and outputting the generated cluster.
An clustering method according to an exemplary aspect of the invention includes: storing a plurality of data sets; and generating an approximate line that approximates as many data sets as possible within a predetermined margin of error among the plurality of data sets in a space in which the plurality of data sets are arranged in accordance with data values, and generating a cluster by classifying the plurality of data sets based on the generated approximate line and outputting the generated cluster.
A computer readable storage medium according to an exemplary aspect of the invention records thereon a program, causing a computer to perform a method including: storing a plurality of data sets; and generating an approximate line that approximates as many data sets as possible within a predetermined margin of error among the plurality of data sets in a space in which the plurality of data sets are arranged in accordance with data values, and generating a cluster by classifying the plurality of data sets based on the generated approximate line and outputting the generated cluster.
Advantageous effects of the present invention is that fast classification of data can be performed according to characteristics.
A configuration of an exemplary embodiment of the present invention will be described first.
The clustering device 100 includes a data input unit 200, a data storage unit 300, a cluster generation unit 400, and a cluster information storage unit 500.
The data input unit 200 receives, from a user or the like, inputs of data sequences of a plurality of types that are to be analyzed. In the exemplary embodiment of the present invention, data time sequences relating to the performance of a computer system, such as an input packet rate and CPU utilization rate in the computer system, are used as data sequences.
The data storage unit 300 stores data sequences of a plurality of types.
The cluster generation unit 400 classifies combinations of data (data combinations) of two types among a plurality of types according to characteristics to generate clusters.
The cluster generation unit 400 includes a data extracting unit 410, a data arranging unit 420, an approximate line generation unit 430, a cluster registering unit 440 and a cluster information output unit 450.
The data extracting unit 410 extracts a plurality of data combinations of data sequences of two types from among a plurality of types.
The data arranging unit 420 arranges the extracted data combinations in a two-dimensional space to generate a data combination image 421.
The approximate line generation unit 430 generates an approximate line that approximates as many data combinations as possible among a plurality of data combinations within a predetermined margin of error on the basis of the data combination image 421.
The cluster registering unit 440 generates a cluster based on the generated approximate line and registers the generated cluster in cluster information 501.
The cluster information output unit 450 outputs the cluster information 501 to a user or the like.
The cluster information storage unit 500 stores the cluster information 501.
Note that the clustering device 100 may be a computer that includes a CPU and a storage medium on which a program is stored and operates under the control based on the program. In this case, the CPU of the clustering device 100 executes a computer program for implementing the functions of the data input unit 200 and the cluster generation unit 400. The storage medium of the clustering device 100 stores information of the data storage unit 300 and the cluster information storage unit 500. The data storage unit 300 and the cluster information storage unit 500 may be implemented by separate storage media or a single storage medium.
Next, the operation of the exemplary embodiment of the present invention will be described.
It is assumed here that data sequences illustrated in
First, the data extracting unit 410 extracts data combinations of data sequences of two types stored in the data storage unit 300 (step S101). For example, the data extracting unit 410 extracts data combinations each of which is acquired at the same time instant. Note that the data extracting unit 410 may use a common attribute other than the time associated with data, such as a particular event, to extract data combinations.
For example, as illustrated in
The data arranging unit 420 arranges the extracted data combinations in a two-dimensional space to generate a data combination image 421 (step S102). The data arranging unit 420 generates the data combination image 421 by arranging points representing the data combinations at positions corresponding to the values of data included in the data combinations in a space using each of the types of data sequences as a dimension.
For example, the data arranging unit 420 arranges the data combinations s0, s1, . . . , s23 in a two-dimensional space having the X (input) axis and Y (CPU utilization rate) axis to generate the data combination image 421 as illustrated in
The approximate line generation unit 430 generates an approximate line that approximates as many data combinations as possible among a plurality of data combinations within a predetermined margin of error in the data combination image 421 (step S103). Note that in the exemplary embodiment of the present invention, straight lines are used as approximate lines.
The approximate line generation unit 430 generates an approximate line by using, for example, the Hough transform, which is a technique for detecting lines in image processing. In the Hough transform, an approximate line that passes through points representing data combinations is represented in a polar coordinate space (θ, ρ). Here, θ is the angle between the X axis and the normal to the approximate line (straight line) and ρ is the distance from the origin to the approximate line. In the Hough transform, a quantized value of ρ is calculated while changing a quantized value of θ for each of the points, and a set of θ and ρ which is the same for as many points as possible is extracted by voting. In the Hough transform, an error between an approximate line represented by θ, ρ and each of the points approximated by the approximate line is dependent on quantization errors in θ and ρ. Accordingly, it can be considered that a quantization step size determines an error in the approximate line (a predetermined margin of error).
For example, the approximate line generation unit 430 generates a line L1 in the data combination image 421 in
Note that the approximate line generation unit 430 may use any method other than the Hough transform to generate an approximate line, as long as the method can generate an approximate line that approximates as many combinations as possible among a plurality of data combinations within a predetermined margin of error in the data combination image 421.
The cluster registering unit 440 extracts data combinations that exist in a predetermined range from the generated approximate line (step S104).
For example, the cluster registering unit 440 extracts data combinations s4 to s8 and s16 to s20 within a width W from the line L1 in the data combination image 421 in
The cluster registering unit 440 generates a cluster having the extracted data combinations as its elements and registers the cluster in the cluster information 501 (step S105). The cluster registering unit 440 registers the number of the extracted data combinations, parameters (slope, intercept) of the generated approximate line and the accuracy of the approximate line together with identifiers of the extracted data combinations. The accuracy of the approximate line can be calculated from the distribution of the extracted data combinations around the approximate line.
For example, the cluster registering unit 440 registers the number of combinations “10”, data combinations “s4 to s8, s16 to s20”, an approximate line “Y=a1X+b1”, and accuracy “d1”, in the piece of cluster information 501 for a cluster “c1”, as illustrated in
The cluster registering unit 440 deletes the data combinations registered in the cluster from the data combination image 421 (step S106).
The cluster registering unit 440 repeats the process from step S103 predetermined times or until there are no data combinations (step S107).
For example, the data arranging unit 420 generates a line L2 in the image obtained by the deletion of the data combinations s4 to s8, s16 to s20 from the data combination image 421 in
Furthermore, the data arranging unit 420 generates a line L3 in the image obtained by deletion of the data combinations s4 to s8, s16 to s20 and s3, s9 to s11, s14, s21 to s23 from the data combination image 421 in
The cluster information output unit 450 outputs the cluster information 501 to a user or the like (step S108). For example, the cluster information output unit 450 outputs the cluster information 501 through a displaying device (not depicted) such as a display.
Note that the cluster information output unit 450 may send the cluster information 501 to another device as a data file.
A user or the like can use the data combinations included in each cluster in the output cluster information 501 as a result of data classification.
Furthermore, an analyzing unit, not depicted, provided in the clustering device 100 or an analyzer or the like, not depicted, that is connected to the clustering device 100 may perform analytical processing such as prediction of a value of data of one type from a value of data of the other type by using the approximate line for each cluster.
For example, an analyzing unit, an analyzer or the like may predict a range of variations in the value of the CPU utilization rate with respect to a range of variations in the value of an input by using the approximate line of each cluster in the cluster information 501 in
In this case, the analyzing unit, the analyzer or the like makes predictions using the approximate line of a cluster specified by a user or the like, for example. For example, the user or the like considers a cluster that includes many combinations or a cluster that has an approximate line with a high degree of accuracy to be a reliable cluster on the basis of the output cluster information 501 and specifies the cluster. Alternatively, the user or the like considers clusters that have approximate lines with large and small slopes to be noise and specifies a cluster that has an approximate line with a slope within a predetermined range.
The cluster information output unit 450 may rearrange pieces of information relating to clusters in the cluster information 501 in descending order of the number of combinations, the accuracy, or slope of approximate line and may output the rearranged pieces of information for allowing a user or the like to select a cluster. The cluster information output unit 450 may extract and output clusters that have numbers of combinations that are equal to or greater than a predetermined threshold or degrees of accuracy that are equal to or greater than a predetermined threshold from among the clusters included in the cluster information 501. Alternatively, the cluster information output unit 450 may extract and output clusters that have approximate lines with slopes within a predetermined range from among the clusters included in the cluster information 501.
Note that on behalf of a user or the like, an analyzing unit, an analyzer or the like may select a cluster that has many combinations, a cluster that has an approximate line with a high degree of accuracy, or a cluster that has an approximate line with a predetermined range of slope, and may make predictions.
Furthermore, an analyzing unit, an analyzer or the like may output the accuracy of an approximate line as the accuracy of prediction by an approximate line along with a result of prediction.
Moreover, an analyzing unit, an analyzer or the like may use attributes associated with clusters to group new pieces of data and may perform analytical processing of the classified pieces of data. For example, if a plurality of data combinations that belong to each cluster were acquired in the same time period, an analyzing unit, an analyzer or the like may make a correlation analysis of new data in the time period associated with each cluster as described in PTL2, for example.
In this case, an attribute to be associated with each cluster is specified by a user or the like, for example. The user or the like specifies an attribute that is common to the data combinations belonging to each cluster as the attribute to be associated with the cluster on the basis of output cluster information 501.
In order that a user or the like can identify the attribute that is common to the data combinations belonging to each cluster, the cluster information output unit 450 may output the cluster to which each data combination belongs in association with the attribute of the data combination.
Note that on behalf of a user or the like, an analyzing unit, an analyzer or the like may identify the attribute that is common to the data combinations belonging to each cluster, may associate the attribute with the cluster and may perform classification and statistical processing of new data.
When data sequences of a plurality of types are stored in the data storage unit 300, steps S101 through S108 may be repeated for each of different combinations of data sequences of two types.
Then the operation of the exemplary embodiment of the present invention ends.
Next, a characteristic configuration of the exemplary embodiment of the present invention will be described.
Referring to
The data storage unit 300 stores a plurality of data sets. The cluster generation unit 400 generates an approximate line that approximates as many data sets as possible within a predetermined margin of error among the plurality of data sets in a space in which the plurality of data sets are arranged in accordance with data values. The cluster generation unit 400 generates a cluster by classifying the plurality of data sets based on the generated approximate line and outputs the generated cluster.
According to the exemplary embodiment of the present invention, fast classification of data can be performed according to characteristics. This is because the cluster generation unit 400 generates an approximate line that approximates as many data combinations as possible within a predetermined margin of error in a space in which data combinations are arranged and generates a cluster based on the generated approximate line. The load of calculating an approximate line can be reduced, for example, by appropriately setting an allowable margin of error, such as quantization errors in the Hough transform. This can reduce the load of clustering and enables fast clustering as compared with an analysis that involves calculation of a distribution density function or an analysis that uses cross validation or Bayesian estimation in a technique of PTL1 and PTL2.
Furthermore, exhaustive and real-time clustering of data combinations of different types can be performed even when there are many types of data or a large number of data combinations.
According to the exemplary embodiment of the present invention, an approximation equation representing the relationship with respect to data combinations included in a cluster can be generated simultaneously with clustering. This is because the approximate line generation unit 430 generates an approximate line as described above.
While the invention has been particularly shown and described with reference to exemplary embodiments thereof, the invention is not limited to these embodiments. It will be understood by those of ordinary skill in the art that various changes in form and details may be made therein without departing from the spirit and scope of the present invention as defined by the claims.
For example, while straight lines are used as approximate lines in the exemplary embodiment of the present invention, lines that have any other shapes, such as circles, quadratic function curves, or logarithmic function curves, may be used as approximate lines, as long as the lines can approximate the relationship with respect to data combinations.
Furthermore, combinations of two types of data are arranged in a two-dimensional space to generate a data combination image 421 and approximate lines are generated on the data combination image 421 in the exemplary embodiments of the present invention. However, clusters for classifying combinations of n types of data (where n is an integer equal to or greater than 2) may be generated by extracting approximate lines from an n-dimensional space in which combinations of n types of data are arranged in a way similar to that descried above. In this case, an analyzing unit, an analyzer or the like may use approximate lines to perform analytical processing such as prediction of values of data of n−1 types from values of data of one type among n types, for example.
For example, when a cluster is generated for a combination of three types of data, an analyzing unit, an analyzer or the like uses an approximate line for each cluster to predict a range of variations in values of data of two of the types that are used as outputs with respect to a range of variations in a value of data of the other type used as an input.
Furthermore, if a range of data values of a certain type varies from one approximate line to another, an analyzing unit, an analyzer or the like may narrow down approximate lines to be used in prediction in accordance with values of data of the type.
Data to be analyzed in the exemplary embodiment of the present invention is data concerning the performance of a computer system, such as the input packet rate and CPU utilization rate in the computer system. However, data to be analyzed may be any items of data that relate to each other, such as data acquired with various sensors, besides data concerning the performance of a computer.
This application is based upon and claims the benefit of priority from Japanese Patent Application No. 2014-054451, filed on Mar. 18, 2014, the disclosure of which is incorporated herein in its entirety by reference.
Number | Date | Country | Kind |
---|---|---|---|
2014-054451 | Mar 2014 | JP | national |
Filing Document | Filing Date | Country | Kind |
---|---|---|---|
PCT/JP2015/001101 | 3/3/2015 | WO | 00 |
Publishing Document | Publishing Date | Country | Kind |
---|---|---|---|
WO2015/141157 | 9/24/2015 | WO | A |
Number | Name | Date | Kind |
---|---|---|---|
20090087104 | Nakate | Apr 2009 | A1 |
20120288149 | Kido | Nov 2012 | A1 |
20130063789 | Iwayama | Mar 2013 | A1 |
20140093177 | Hayashi | Apr 2014 | A1 |
Number | Date | Country |
---|---|---|
10-198789 | Jul 1998 | JP |
2003-242160 | Aug 2003 | JP |
2009-99120 | May 2009 | JP |
2011-133988 | Jul 2011 | JP |
2013-8289 | Jan 2013 | JP |
5141789 | Feb 2013 | JP |
2013128789 | Sep 2013 | WO |
Entry |
---|
Ryohei Fujimaki, et al., “The Most Advanced Data Mining of the Big Data Era”, NEC Technical Journal, NEC Corporation, Sep. 2013, pp. 81-85, vol. 65, No. 02/2012. |
Ryohei Fujimaki, et al., “The Most Advanced Data Mining of the Big Data Era”, NEC Technical Journal, NEC Corporation, Sep. 2013, pp. 91-95, vol. 7, No. 02/2012. |
International Search Report of PCT/JP2015/001101, dated May 26, 2015. [PCT/ISA/210]. |
Written Opinion of PCT/JP2015/001101, dated May 26, 2015. [PCT/ISA/237]. |
Number | Date | Country | |
---|---|---|---|
20170083605 A1 | Mar 2017 | US |