Embodiments described herein relate generally to a data compression device, a data compression method, and a computer program product.
A method of data compression is known in which, from time-series data that is input, point data constituting the time-series data is subjected to thinning. In regard to such a method of data compression; the Box Car algorithm, the Backward Slope algorithm, and the Swinging Door algorithm are available.
The Swinging Door algorithm is a representative example of the algorithms in which data thinning is achieved by performing linear approximation in such a way that the error is equal to or smaller than a predetermined threshold value. In the Swinging Door algorithm, a single point is fixed as the starting point, and linear approximation is performed in such a way that the error is equal to or smaller than a predetermined threshold value.
There is a trend of an increase in the types and the size of time-series data stored in time-series databases. In that regard, there has been a demand for a method which would enable achieving compression of time-series data in a more efficient manner.
According to an embodiment, a data compression device includes a receiving unit, a generating unit, a selecting unit, and a compressing unit. The receiving receives a plurality of pieces of input data which is input in chronological order. The generating unit generates a plurality of starting point candidates which represents the data having an error within a threshold value with respect to starting point data. The starting point data is the input data input at a first timing. The selecting unit refers to the starting point candidates, end point data which is the input data input at a second timing, and intermediate data which is the input data input at a timing in between the first timing and the second timing; and selects, from among the starting point candidates, the starting point candidate which, as compared to the other starting point candidates, has a greater number of pieces of the intermediate data approximated using the starting point candidate and using the end point data in such a way that the error is within the threshold value. The compressing unit outputs the selected starting point candidate and the end point data as output data obtained by compressing the starting point data, the intermediate data, and the end point data.
Various embodiments are described below in detail with reference to the accompanying drawings.
As described above, according to the Swinging Door algorithm or the like, a single starting point is set and linear approximation is performed in order to compress time-series data. In a data compression device according to a first embodiment, a plurality of starting points (starting point candidates) is set, and time-series data is compressed using such a starting point candidate which enables achieving compression in a more efficient manner.
Herein, the explanation is given about the terms used in the embodiments.
Time-series data points to a series of values (a point data string) obtained by observing or measuring temporal changes of a particular phenomenon. Usually, time-series data is measured at predetermined time intervals. Examples of time-series data include share prices and sensor values of an in-plant installation. For example, regarding a number of devices constituting an in-plant installation; a series of values of humidity, a series of values of vibrations, or a series of values of control setting can be said to form a single piece of time-series data.
A time-series database is created by compiling time-series data. As a time-series database, a large volume of time-series data is stored in chronological order in a memory of a computer or in an external memory device (a hard disk).
A data item which is the smallest unit of data storage is called a tag. A tag is made of a data value, a time stamp, a data status, and the like. The data that is to be collected has types such as operating data which is input from a control system, computational data which is obtained by implementing an online computation function, data which is manually input by an operator, and interface data which is input from other systems.
In a time-series database, generally, there are several thousands to several tens of thousands of tags. The data storage period for each tag ranges from one year to several years. As far as the data collection cycle is concerned, although it is dependent on the real-time property of the concerned system (such as an in-plant installation); it ranges from a few seconds to one minute as a rough indication.
If it is assumed that the collected data is stored without any modification, then a time-series database needs to have the database capacity of about 10 GB (gigabytes) to 10 TB (terabytes). If the database capacity is increased to that extent, then it is bound to cause deterioration in the retrieval performance.
In that regard, for example, in an in-plant installation, a technology of data compression is implemented by making use of the property that, during stable operations, the operating data undergoes only a small change. In an in-plant installation, it is estimated that the behavior of the original data can be understood by referring to compressed data having the compression ratio of 1:20.
In this way, since a time-series database needs a large-capacity memory area; there has been a demand for a method which would enable achieving compression of time-series data in a more efficient manner.
Given below is the explanation of a data compression device according to a first embodiment.
The receiving unit 101, the registering unit 110, and the searching unit 114 can be implemented, for example, by executing a program in a processor such as a CPU (Central Processing Unit), that is, can be implemented using software; or can be implemented using hardware such as an IC (Integrated Circuit); or can be implemented using a combination of software and hardware.
The memory it 121 is used to store a variety of data. For example, the memory unit 121 is used to store time-series data that has been compressed by a compressing unit 113. The memory unit 121 can be configured with any commonly-used memory medium such as an HDD (Hard Disk Drive), an optical disk, a memory card, and a RAM (Random Access Memory).
The receiving unit 101 receives processing requests and data that are input from external devices such as client devices. A processing request points to, for example, a time-series data registration request or a time-series data search request. In the case of a registration request, the receiving unit 101 receives a plurality of pieces of input data (point data of time-series data) that is input in chronological order. Alternatively, the receiving unit 101 can also receive pieces of point data that are input in real time. For example, the receiving unit 101 stores the pieces of point data, which are input in real time in the memory unit 121. Still alternatively, the receiving unit 101 can receive pieces of point data in chronological order from the time-series data stored in the memory unit 121. In the case of receiving input of the time-series data from the memory unit 121, the configuration can be such that the receiving unit 101 goes back in time to a particular timing serving as the starting point, that is, receives the pieces of point data in a sequential manner starting from an earlier timing.
The registering unit 110 performs, based on an allowable error, an operation (a compression operation) of thinning a piece of point data from a series of point data that is input; and registers the post-thinning point data as time-series data in the memory unit 121. Regarding an algorithm for thinning a piece of point data using starting point candidates and using other pieces of point data; it is possible to use any one of the conventional algorithms such as the Swinging Door algorithm. Meanwhile, the registering unit 110 includes a generating unit 111, a selecting unit 112, and the compressing unit 113.
The generating unit 111 generates a plurality of starting point candidates that represents data having an error within a predetermined threshold value with respect to starting point data, which is the point data at a particular timing (a first timing).
The selecting unit 112 selects, from among the starting point candidates, such a starting point candidate which enables achieving compression, of the time-series data in a more efficient manner. For example, the selecting unit 112 selects, from among the starting point candidates, such a starting point candidate which has a greater number of pieces of point data (intermediate data) that is approximated using the concerned starting point candidate and using end point data, which is input at a different timing (a second timing) than the timing of the starting point, data, in such a way that the error is within a predetermined threshold value.
Then, the compressing unit 113 outputs the selected starting point candidate and the end point data as a piece of post-compression time-series data (output data). The compressing unit 113 stores, for example, the post-compression time-series data in the memory unit 121 in a sequential manner. Alternatively, the compressing unit 113 can also store at once a plurality of pieces of post-compression time-series in the memory unit 121.
The searching unit 114 searches for the time-series data that is stored in the memory unit 121. For example, when a start timing, an end timing, and a sampling interval are specified; the searching unit 114 searches, at the specified sampling interval, the time-series database for a point data series in a section from the start timing to the end timing. Since there are times when the registering unit 110 performs point data thinning, sometimes point data may not be retrieved at the specified sampling interval. In such a case, the searching unit 114 interpolates the point data using, for example, the linear interpolation method that is an example of the methods to perform interpolation between two points, if (xs, ys) represents the start point and if (xe, ye) represents the end point; then, with respect to an arbitrary x present on the straight line joining the start point and the end point, the value of y can be obtained using Equation (1). Meanwhile, xe≠xs is satisfied.
y=ys+(x−xs)(ye−ys)/(xe−xs) (1)
Given below is the explanation of a specific example of a data compression operation performed according to the first embodiment.
P1<t1, v1>, P2<t2, v2>, P3<t3, v3>, P4<t4, v4>, and P5<t5, v5> are present. Moreover, t1<t2<t3<t4<t5 is satisfied.
Firstly, with respect to the value <t2, v2> of P2, the registering unit 110 obtains two piece of point data P2′<t2, v2+α> and P2″<t2, v2−α> that have the largest allowable error at the timing t2. The upper limit slope US2 represents the tilt or the line segment from P1 to P2′, and can be obtained as US2=(v2+α−V1)/(t2−t1). The lower limit slope LS2 represents the tilt of the line segment from P1 to P2″, and can be obtained as LS2=(v2−α−v1)/(t2−t1).
As illustrated in
If the upper limit slope US3 up to P3 is smaller than the upper limit slope US2 up to P2 and if the lower limit slope LS3 up to P3 is greater than the lower limit slope LS2 up to P2, then the older piece of point data P2 is subjected to thinning.
As illustrated in
However, in the example illustrated in
As illustrated in
The registering unit 110 calculates, for example, “LS2>US3LS3>US2”. If that value is true, then the registering unit 110 determines that the allowable error range with respect to P2 does not overlap with the provisional allowable error range with respect to P3. However, if that value is false, then the registering unit 110 determines that the allowable error range with respect to P2 overlaps with the provisional allowable error range with respect to P3.
In the example illustrated in
US3′=Min(US3, US2)
LS3′=Max(LS3, LS2)
As illustrated in
US4′=Min(US4, US3)
LS4′=Max(LS4, LS3)
US4=US4′
LS4=LS4′
As illustrated in
As far as the algorithm in the method for compression is concerned, the registering unit 110 can implement any one of the first method and the second method. Alternatively, the registering unit 110 can implement any other algorithm too. Conventionally, any such algorithm is implemented by setting a single starting point. However, in contrast, the registering unit 110 sets a plurality of starting points (starting point candidates) and implements an abovementioned algorithm with respect to a plurality of starting point candidates.
If the number of starting points to be generated is set to three; then the generating unit 111 generates, for example, P1<t1, v1>, P1′<t1, v1+α>, and P1″<t1, v1−α> as the starting point candidates. If the number of starting points to be generated is set to N; then the generating unit 111 generates, for example, <t1, v1+α>, <t1, v1+α×(1−2/(N−1))x1>, <t1, v1+α×(1−2/(N−1))x2>, <t1, v1>, . . . , and <t1, v1−α> as the starting point candidates. However, the method of generating the starting point candidates is not limited to this method. That is as long as a value is within the range of the allowable error a centered on the starting point data, any piece of point data can be treated as a starting point candidate.
Explained below with reference to
In
In
In this way, in the first and second methods for compression, only a single point serves as the starting point at t1. In contrast, in the first embodiment, a plurality of starting point candidates is set, and thinning calculation is performed in parallel while treating each starting point candidate as the starting point. For that reason, in the examples given above, when a single starting point is present, thinning is possible only up to P4 at a maximum. In contrast, in the method according to the first embodiment, thinning can be performed up to 35. In this way, in the method according to the first embodiment, at the same allowable error, a higher compression ratio is achieved.
Explained below with reference to
Firstly, the selecting unit 112 selects the starting point data (Step S101). For example, when pieces of time-series data are input in real time; the selecting unit 112 can select, as the starting point data, either the piece of point data that is input at the start or the piece of point data that is input after completion of a thinning operation with respect to already-input pieces of point data. Alternatively, when pieces of point data are sequentially input from already-stored time-series data; the selecting unit 112 can select, as the starting point data, either the piece of point data that is input at the start or the piece of point data that is input after completion of a thinning operation with respect to already-input pieces of point data.
Then, the generating unit 111 generates a plurality of starting point candidates that has an error within the allowable error with respect to the selected starting point data (Step S102).
Subsequently, the selecting unit 112 selects next point data (Step S103). Herein, the next point data indicates a piece of point data that, with the timing of input of the starting point data (the first timing) serving as the reference timing, is sequentially input at each successive timing (a second timing). The next point data is selected by sequentially shifting the timing until thinning cannot be performed any more. In the following explanation, the next point data selected at the previous timing is called former next-point data. Thus, the former next-point data at the timing at which thinning cannot be performed any more is equivalent to the end point data. The pieces of point data selected prior to the former next point data are equivalent to the intermediate data input in between the starting point data and the end point data (the former next-point data) present in the end.
As described above, the timing of input of the next point data can be a timing before or after the timing of input of the starting point data. Moreover, for example, when the pieces of point data are sequentially input, from the already-stored time-series data, there can be a situation in which the starting point data selected at Step S101 becomes the last piece of point data, and the next point data cannot be selected (obtained). In such a case, although not illustrated in
Subsequently, the selecting unit 112 selects a single starting point candidate from among the starting point candidates that are generated (Step S104). Then, the selecting unit 112 determines whether or not the selected starting point candidate has been disabled (Step S105). Herein, disabling means exempting from the subsequent operations such a starting point candidate at which thinning cannot be performed using the next point data that has been selected. For example, a starting point candidate at which thinning could not be performed during operations for the former next-point data is disabled while processing the former next-point data (Step S109 described later). In this way, at Step S105, it is determined whether the selected starting point candidate has been disabled during the operations performed till the previous step.
If the selected starting point candidate has been disabled (Yes at Step S105); then the selecting unit 112 returns to the operation at Step S104, selects the next starting point candidate, and repeats the operations. On the other hand, if the selected starting point candidate has not been disabled (No at Step S105), then the selecting unit 112 calculates the upper limit slope and the lower limit slope from the selected starting point candidate up to the next point data (Step S106). Then, the selecting unit 112 compares the calculated upper limit slope and the calculated lower limit slope with the upper limit slope and the lower limit slope calculated with respect to the former next-point data (Step S107). For example, the selecting unit 112 determines whether or not the allowable error range identified by the upper limit slope and the lower limit slope of the former next-point data overlaps with the allowable error range identified by the upper limit slope and the lower limit slope of the next point data.
Thus, the selecting unit 112 determines whether or not the two allowable error ranges overlap with each other (Step S108). If the two allowable error ranges do not overlap with each other (No at Step S108), the selecting unit 112 disables the currently-selected starting point candidate (Step S109). Then, the selecting unit 112 returns to the operation at Step S104. On the other hand, if the two allowable error ranges overlap with each other (Yes at Step S108), then the selecting unit 112 updates the upper limit slope and the lower limit slope tram the starting point candidate with the upper limit slope and the lower limit slope calculated with respect to the existing next point data (Step S110).
Subsequently, the selecting unit 112 determines whether or not all starting point candidates have been processed (Step S111). If all starting point candidates are yet to be processed (No at Step S111), the selecting unit 112 returns to the operation at Step S104 and repeats the operations. When all starting point candidates are processed (Yes at Step S111), the selecting unit 112 determines whether or not all starting point candidates have been disabled (Step S112). If all starting point candidates are yet to be disabled (No at Step S112), then the selecting unit 112 selects the point data at the next successive timing as the new next point data, and repeats the operations (Step S103).
When all starting point candidates are disabled (Yes at Step S112), the selecting unit 112 selects the starting point candidate that is disabled in the last instance (Step S113). As a result of such operations, the selecting unit 112 becomes able to select the starting point candidate at which a greater number of pieces of point data (intermediate data) are approximated to have the error within the allowable error.
Meanwhile, if there is a plurality of starting point candidates that is disabled in the last instance, the selecting unit 112 selects one of those starting point candidates. Herein, from among a plurality of starting point candidates that is disabled in the last instance, the selecting unit 112 selects the starting point candidate having a closer value to the starting point data.
Depending on the selected starting point candidate, the compressing unit 113 performs post-processing for the purpose of correcting the value of the end point data (the former next-point data) (Step S114). Alternatively, the configuration can be such that the end point data is output without performing post-processing.
Returning to the explanation with reference to
In this way, in the data compression according to the first embodiment, a plurality of starting point candidates is set, and thinning calculation is performed while treating each starting point candidate as the starting point. Then, a starting point candidate is selected at which a greater volume of data can be subjected to thinning, and the data subjected to thinning using the selected starting point candidate is output as the resultant data of compression. As a result, it becomes possible to enhance the compression ratio of the time-series data.
As a result of implementing the method according to the first embodiment, although an enhancement in the compression ratio is achieved, there also occurs an increase in the amount of calculation because of the thinning calculation performed in parallel among a plurality of starting point candidates. In that regard, in a data compression device according to a second embodiment, an operation (filtering) for skipping the thinning calculation is also performed.
In the second embodiment, a selecting unit 112-2 of the registering unit 110-2 has different functions than the first embodiment. Apart from that, the configuration and the functions are identical to
The selecting unit 112-2 not only has the functions of the selecting unit 112 but also has an additional function of filtering. The selecting unit 112-2 determines, prior to performing operations with respect to each starting point candidate, whether a range that is approximated to be within the allowable error in the former next-point data and a range that is approximated to be within the allowable error in the next point dare satisfy a predetermined condition. If the condition is satisfied, then the selecting unit 112-2 determines that approximation cannot be done to be within the allowable error in the next point data, and does not perform a determination operation with respect to the starting point candidates.
For example, the selecting unit 112-2 compares the minimum lower limit slope and the maximum upper limit slope of the former next-point data with the minimum lower limit slope and the maximum upper limit, slope of the (existing) next point data, and determines whether a predetermined condition is satisfied. Then, the selecting unit 112-2 obtains a determination value (such as true or false) indicating whether or not the condition is satisfied, and skips operations with respect to the starting point candidates according to the determination value.
The minimum lower limit slope represents the minimum value from among the slopes (slopes) between the starting point candidates and a value obtained by subtracting the allowable error from the point data. The maximum upper limit slope represents the maximum value from among the slopes (slopes) between the starting point candidates and a value obtained by adding the allowable error to the point data.
With respect to P5, there are three upper limit slopes (referred to as US5, US5′, and US5″) having the starting points P1, P1′, and P1″, respectively. The maximum upper limit slope MaxUS indicates the maximum value from among the three upper limit slopes.
MaxUS=Max(US5, US5′, US5″)
In an identical manner, with respect to P5, there are three lower limit slopes (referred to as LS5, LS5′, and LS5″) having the starting points P1, P1′, and P1″, respectively. The minimum upper limit slope MinLS indicates the minimum value from among the three upper limit slopes.
MinLS=Min(LS5, LS5′, LS5″)
With respect to P4 too, the maximum upper limit slope and the minimum lower limit slope can be obtained in the following manner.
MaxUS=(US4, US4′, US4″)
MinLS=(LS4, LS4′, LS4″)
Herein, regarding P4 (the former next-point data), it is assumed that MinLS4 represents the minimum lower limit slope and MaxUS4 represents the maximum upper limit slope. Moreover, regarding P5 (the former next-point data), it is assumed that MinLS5 represents the minimum lower limit slope and MaxUS5 represents the maximum upper limit slope. The selecting unit 112-2 compares MinLS4, MaxUS4, MinLS5, and MaxUS5 according to the following condition, and calculates a determination value that indicates whether or not the condition is satisfied.
“MaxUS4<MinLS5”“MinLS4>MaxUS5”
This condition indicates that “either the maximum upper limit slope at P4 is smaller than the minimum lower limit slope at P5 or the minimum lower limit slope at P4 is greater than the maximum upper slope at P5”. In such a case, it is clear that the allowable error range with respect to P4 does not overlap with the allowable error range with respect to P5. For that reason, it becomes possible to skip the operation of calculating the slope for each starting point candidate, and to continue with the operations assuming that all starting point candidates have been disabled. That is, it becomes possible to reduce the amount of calculation by means of avoiding unnecessary calculation.
Explained below with reference to
The operations performed from Step S201 to Step S203 are identical to the operations performed from Step S101 to Step S103 in the data compression device 100 according to the first embodiment. Hence, that explanation is not repeated.
In the second embodiment, the selecting unit 112-2 calculates the determination value mentioned above (Step S204). Then, the selecting unit 112-2 determines whether or not the determination value is true (Step S205). If the determination value is false (No at Step S205), then the selecting unit 112-2 performs operations with respect to each starting point candidate (Step S206 to Step S214). Herein, the operations performed from Step S206 to Step S214 are identical to the operations performed from Step S104 to Step S112 according to the first embodiment. Hence, that explanation is repeated.
On the other hand, if the determination value is true (Yes at Step S205), then the selecting unit 112-2 proceeds to the operation at Step S215 without performing the operations from Step S206 to Step S214. The operations performed from Step S213 to Step S217 are identical to the operations performed from Step S113 to Step S115 according to the first embodiment. Hence, that explanation is not repeated.
In this way, in the data compression device according to the second embodiment, filtering is additionally performed with the aim of skipping the thinning calculation. As a result, it becomes possible to hold down the increase in the amount of calculation attributed to the use of a plurality or starting points.
Explained below with reference to
The data compression device according to the first embodiment or the second embodiment includes a control device such as a CPU (Central Processing Unit) 51; memory devices such as a ROM (Read Only Memory) 52 and a RAM (Random Access Memory) 53; a communication I/F 54 that performs communication by establishing connection with a network; and a bus 61 that interconnects the other constituent elements.
Meanwhile, a data compression program executed in the data compression device according to the first embodiment or the second embodiment is stored in advance in the ROM 52 or the like.
Alternatively, the data compression program executed in the data compression device according to the first embodiment or the second embodiment can be recorded in the form of an installable or an executable file in a computer-readable recording medium such as a CD-ROM (Compact Disk Read Only Memory), a flexible disk (FD), to CD-R (Compact Read Recordable), or a DVD (Digital Versatile Disk).
Still alternatively, the data compression program executed in the data compression device according to the first embodiment or the second embodiment can be saved as a downloadable file on a computer connected to the Internet or can be made available for distribution through a network such as the internet.
The data compression program executed in the data compression device according to the first embodiment or the second embodiment can cause a computer to function as each constituent element of the data compression device described above. In that computer, the CPU 51 can read the data compression program from a computer-readable recording medium in a main memory device and then execute the data compression program.
While certain embodiments have been described, these embodiments have been presented by way of example only, and are not intended to limit the scope of the inventions. Indeed, the novel embodiments described herein may be embodied in a variety of other forms; furthermore, various omissions, substitutions and changes in the form of the embodiments described herein may be made without departing from the spirit of the inventions. The accompanying claims and their equivalents are intended to cover such forms or modifications as would fall within the scope and spirit of the inventions.
This application is a continuation of PCT international application Ser. No. PCT/JP2013/052245 filed on Jan. 31, 2013 which designates the United States, the entire contents of which are incorporated herein by reference.
Number | Date | Country | |
---|---|---|---|
Parent | PCT/JP2013/052245 | Jan 2013 | US |
Child | 14208061 | US |