The present disclosure relates to a time-series data analysis device, a time-series data analysis method, and a non-transitory computer readable medium storing time-series data analysis program for extracting a feature from time-series data.
When the time-series data is analyzed and the feature of the time-series data is extracted, a slide window is used. In order to use the slide window, the length of the slide window (Hereinafter, the length may be referred to as a “window length”.) is specified. For example, Non-Patent Literature 1 discloses a technique of generating a distance matrix in which a distance between time series subsequences having a length m in time-series data is obtained using a slide window having a length m, and extracting a feature called a matrix profile from the distance matrix.
Non-Patent Literature 1: Yeh, Chin-Chia Michael, et al. “Matrix profile I:all pairs similarity joins for time series: a unifying view that includes motifs, discords and shapelets.” 2016 IEEE 16th international conference on data mining (ICDM). IEEE, 2016.
However, in conventional techniques using a slide window including the technique of Non-Patent Literature 1, since the window length is determined by a person who analyzes time-series data, there is a problem that a significant feature cannot be extracted in some cases.
The present disclosure has been made in order to solve the above problem, and an object according to an aspect of the embodiments is to provide a time-series data analysis device capable of proposing a window length of a slide window used when time-series data is analyzed.
A time-series data analysis device according to the present disclosure includes: processing circuitry to receive time-series data; to set a range of a window length of a time series subsequence in the time-series data; to calculate a feature of the time-series data for each of a plurality of window lengths within the range; to calculate a probability density distribution of the calculated feature for each of the plurality of window lengths: and to calculate a statistical feature of the probability density distribution calculated for each of the plurality of window lengths and select a window length to be used from among the plurality of window lengths on the basis of the calculated statistical feature.
According to an aspect of the time-series data analysis device of the present disclosure, it is possible to propose a window length used when a feature is extracted from time-series data.
Hereinafter, various embodiments according to the present disclosure will be described in detail with reference to the drawings.
The time-series data input unit 110 receives time-series data. Examples of the time-series data include, but are not limited to, industrial data such as voltage, current, frequency, or motor rotational speed acquired from a monitored device, medical data such as pulse rate, respiratory rate, or blood pressure, economic data such as stock price, future transaction price, or gross domestic product, and social activity data such as the number of passengers of public transportation such as trains, buses, or airplanes. As an example, the time-series data input unit 110 receives the time-series data as illustrated in
The parameter setting unit 120 receives a parameter value input by a user and used for analysis. The parameter may include an upper limit value and a lower limit value that define the range of the window length of the time series subsequence in the time-series data, and a value that specifies the type of normalization method when calculating the Euclidean distance between the time series subsequences from the time-series data.
The window length range may be set by the time-series data analysis device 100 instead of the user input. For example, as in a time-series data analysis device 100A illustrated in
As the normalization method, for example, there are no normalization which performs no normalization, an average zeroing in which an average value of a time series subsequence is subtracted from each value of a time series subsequence for each time series subsequence, z normalization in which an average value is subtracted from each value of a time series subsequence for each time series subsequence and divided by a standard deviation, and a method using a correlation coefficient.
The parameter setting unit 120 or the parameter setting unit 120A supplies the value of the received or determined parameter to the feature calculating unit 130.
The feature calculating unit 130 calculates the feature of the time-series data on the basis of the time-series data supplied from the time-series data input unit 110 and the parameter value supplied from the parameter setting unit 120. In the first embodiment, a matrix profile is calculated as the feature. Specifically, the matrix profile is calculated as follows.
First, the feature calculating unit 130 sets a time series subsequence defined by a window having a window length m from the head of the time-series data as a reference. Note that the window length m is a length within a range defined by the above-described lower limit value and upper limit value. Next, the feature calculating unit 130 slides the window in the time-series data from the head of the time-series data one time step at a time to calculate a pairwise Euclidean distance between the reference and a time series subsequence at each time step, and to generate a first distance vector having the pairwise Euclidean distances as elements. The Euclidean distances are calculated according to a specified normalization method.
Next, the feature calculating unit 130 shifts the window to the second position from the head of the time-series data by one time step without changing the window length m, and sets the time series subsequence defined by the shifted window as a new reference. The feature calculating unit 130 slides the window in the time-series data from the head of the time-series data one time step at a time to calculate a pairwise Euclidean distance between the new reference and a time series subsequence at each time step, and to generate a second distance vector having the pairwise Euclidean distances as elements.
Next, the feature calculating unit 130 further shifts the window to the third position from the head of the time-series data by one time step without changing the window length m, and repeats the same processing. In this way, while the time series subsequence used as the reference is changed, a pairwise distance profile is generated between the reference having the window length m and each time series subsequence defined by the window length having the same size, and a plurality of distance vectors having pairwise Euclidean distances as elements are generated.
The feature calculating unit 130 generates a distance matrix by vertically arranging the plurality of generated distance vectors. Specifically, the feature calculating unit 130 extracts the minimum distance for each row from the generated distance matrix in which the diagonal components and their peripheral components are excluded. From the first row of the distance matrix, the minimum distance in a case where the reference is the head position of the time-series data is extracted. From the second row of the distance matrix, the minimum distance in a case where the reference is at the second position from the head of the time-series data is extracted. From the third row of the distance matrix, the minimum distance in a case where the reference is the third position from the head of the time-series data is extracted. In general, from the n-th row of the distance matrix, the minimum distance in a case where the reference is at the n-th position from the head of the time-series data is extracted. Note that n is a positive integer. As a result, the minimum distance at each position of the reference, that is, at each time is extracted. In this way, a profile of the distance matrix in a case of the window length m is generated. That is, a matrix profile is generated. Note that the diagonal component in the distance matrix is a distance between each reference and itself, and is a trivial match that is always zero. In addition, the peripheral component of the diagonal component may also be zero. Since such a diagonal component and its peripheral component do not have significant information, the minimum distance is extracted from the distance matrix in which the diagonal components and their peripheral components are excluded. Note that “peripheral” is, for example, a range from a before-after window length of the diagonal component to a window length/k (about k ≤ 4), but is not limited to this specific example.
Note that a distance matrix may be generated by transposing the plurality of generated distance vectors as vertical vectors and arranging the vectors horizontally, and the minimum distance may be extracted from each row in the generated distance matrix.
A time series subsequence having a particularly small value in the generated matrix profile means that there are other time series subsequences similar to this time series subsequence. That is, it is suggested that some pattern is stored in the time-series data. Conversely, a time series subsequence having a particularly large value in the matrix profile means an outlier, and the presence of such a time series subsequence suggests that an anomaly is included in the time-series data.
The feature calculating unit 130 changes the value of the length m of the time series subsequence within the range of the upper limit value and the lower limit value of the length of the time series subsequence set via the parameter setting unit 120, and generates a matrix profile for each of various types of m. The matrix profile may be generated for all values within that range, or may be generated for discrete values. Such an aggregate of two or more matrix profiles, including matrix profiles generated for all values of m, as well as matrix profiles generated for discrete values of m, may be referred to herein as a pan-matrix profile (PMP). Furthermore, in a case where there are a plurality of specified normalization methods, the feature calculating unit 130 generates PMPs for all the methods.
The feature calculating unit 130 supplies the generated pan-matrix profile PMP to the probability density distribution calculating unit 140 as a feature of time-series data. The processing result of the feature calculating unit 130 may be temporarily stored in a storage (not illustrated), and a functional unit at a subsequent stage may perform predetermined processing with reference to the storage as necessary. The storage may be provided in the time-series data analysis device 100 or may be provided outside the time-series data analysis device 100. The same applies to the processing results of the probability density distribution calculating unit 140, the heat map creating unit 150, and the parameter selecting unit 160.
As described above, the feature calculating unit 130 calculates a matrix profile, which is a feature of time-series data, for each of the plurality of window lengths within the range set by the parameter selecting unit 120.
The probability density distribution calculating unit 140 calculates a probability density distribution PDD from the feature supplied from the feature calculating unit 130. Specifically, the probability density distribution calculating unit 140 calculates the probability density distribution PDD of the pan-matrix profile PMP using the value of the normalized matrix profile as a random variable for each window length of the pan-matrix profile PMP supplied from the feature calculating unit 130.
The small value of the matrix profile means that the degree of similarity between the reference time series subsequence and the comparison target time series subsequence is high. Conversely, the large value of the matrix profile means that the degree of dissimilarity between the reference time series subsequence and the comparison target time series subsequence is high.
Therefore, in the case of time-series data including random signals and rare regular signals, when a window length is appropriately set from the time-series data and a matrix profile is created to create a probability density distribution, a peak of the distribution in the probability density distribution is on the right side. That is, the skewness becomes negative, and the peak of the distribution in the probability density distribution appears in a region where the value of the matrix profile is large. The appropriate window length mentioned here is a time width of the regular signal.
In addition, in the case of time-series data including a regular signal and an irregular or sudden anomaly that occurs rarely, when a window length is appropriately set from the time-series data and a matrix profile is created to create a probability density distribution, a peak of the distribution in the probability density distribution is on the left side. That is, the skewness is positive, and the peak of the distribution in the probability density distribution appears in the region where the value of the matrix profile is small. Examples of time-series data including irregularity or sudden anomaly include, for example, electrocardiogram data acquired from a person with arrhythmia.
The probability density distribution calculating unit 140 supplies the probability density distribution PDD calculated for each m to the heat map creating unit 150 and the parameter selecting unit 160.
The heat map creating unit 150 creates a heat map from the probability density distribution PDD for each m supplied from the probability density distribution calculating unit 140. Here, an example of the heat map created by the heat map creating unit 150 will be described with reference to
In the heat map of
In the heat map of
The peak in the circle A or the circle B and the window length related to the peak may be determined by the user with reference to the heat map or may be extracted by processing of the parameter selecting unit 160 described later. The heat map creating unit 150 outputs the created heat map to the output unit 170.
The parameter selecting unit 160 calculates a statistical feature of the probability density distribution PDD from the probability density distribution PDD for each m supplied from the probability density distribution calculating unit 140. Examples of the statistical feature include a maximum value, a standard deviation, a skewness, and a kurtosis. In addition, the parameter selecting unit 160 calculates or selects a set of an appropriate window length and a value of a type of a normalization method using the calculated statistical feature. Processing performed by the parameter selecting unit 160 will be described with reference to
In the fifth line, the window length w and the normalization type n, which are parameters output as the selection result, are initialized. Since the window length w does not become 0 or less, the window length w is initialized to -1, for example. The type of normalization n is initialized by any value nan that is not included in a set N of types of normalization. In addition, the variable p used in the algorithm is also initialized.
In the block of for sentence in the sixth to twentieth lines, the parameter selecting unit 160 executes processing in the seventh to twentieth lines for each type of normalization. In the seventh line and the eighth line, the parameter selecting unit 160 selects a window length wprob in which the maximum value MAX of the probability density first becomes maximum, and stores the value of MAX at that time. In the ninth line and the tenth line, the parameter selecting unit 160 selects a window length wstd at which the standard deviation STD first becomes minimum, and stores the value of MAX at that time. In the eleventh line and the twelfth line, the parameter selecting unit 160 selects a window length wskew at which the skewness SKEW first becomes maximum, and stores the value of MAX at that time. In the thirteenth line and the fourteenth line, when the maximum value pprob of the probability density when MAX becomes maximum is larger than the maximum value pstd of STD and the maximum value pskew of SKEW, the parameter selecting unit 160 stores the probability density pprob at that time as pcand, and sets the window length wprob at that time as a result candidate wcand. In the fifteenth to eighteenth lines, the parameter selecting unit 160 performs the same determination as that in the thirteenth line to fourteenth lines also for STD and SKEW. In the nineteenth to twentieth lines, when the probability density pcand selected in the thirteenth to eighteenth lines is larger than the probability density p of the intermediate result of the selection result, the probability density p of the intermediate result is updated with pcand, the window length w is updated with wcand, and the type of normalization n is updated with ni.
In the twenty-first line, the parameter selecting unit 160 outputs a set of the window length w and the normalization type n.
According to the above algorithm, in the case of the heat map of
Although the algorithm that outputs the set of the window length w and the normalization type n when finding discord has been described above, in the case of finding motif, argmax in the eleventh line may be changed to argmin that obtains the minimum value. As a result, in the case of the heat map of
In the algorithm of
The parameter selecting unit 160 supplies a set of the selected window length w and the normalization type n to the output unit 170.
The output unit 170 outputs the heat map supplied from the heat map creating unit 150 and the selected parameter supplied from the parameter selecting unit 160 to an external device such as a display device.
Next, a hardware configuration example of the time-series data analysis device 100 will be described with reference to
As another example, as illustrated in
Next, the operation of the time-series data analysis device 100 will be described with reference to
The parameter setting unit 120A may calculate the upper limit value or the lower limit value of the length of the time series subsequence from the time-series data. Specifically, the parameter setting unit 120A may determine a range such as 10 (lower limit value) to 1/n (upper limit value) of the length of the time-series data. The character n is any positive integer. Alternatively, the parameter setting unit 120A may determine the range such as 1/n (lower limit value) of the length of the time-series data to 1000 (upper limit value). Alternatively, the parameter setting unit 120A may determine the range such as 1/10n (lower limit value) to 1/n (upper limit value) of the length of the time-series data In a case where the parameter setting unit 120A calculates the upper limit value or the lower limit value of the length of the time series subsequence from the time-series data, step ST102 is performed after step ST101.
In step ST103, the feature calculating unit 130 calculates the feature of the time-series data on the basis of the time-series data output by the time-series data input unit 110 and the parameter value output by the parameter setting unit 120 or the parameter setting unit 120A. For example, a matrix profile is calculated as the feature. The feature calculating unit 130 outputs the calculated feature.
In step ST104, the probability density distribution calculating unit 140 calculates the probability density distribution PDD of the feature of the time-series data output from the feature calculating unit 130. For example, the probability density distribution calculating unit 140 calculates the probability density distribution PDD for each of the plurality of matrix profiles supplied from the feature calculating unit 130. The probability density distribution calculating unit 140 outputs the calculated probability density distribution PDD.
In step ST105, the parameter selecting unit 160 calculates a statistical feature from the probability density distribution PDD calculated by the probability density distribution calculating unit 140, and selects a window length using the calculated statistical feature.
The above-described program may be stored in a storage medium. Examples of the storage medium include a nonvolatile or volatile semiconductor memory such as a random access memory (RAM), a read only memory (ROM), a flash memory, an erasable programmable read only memory (EPROM), or an electrically-EPROM (EEPROM), a magnetic disk, a flexible disk, an optical disk, a compact disk, a mini disk, or a DVD.
Although the configuration and operation of the time-series data analysis device 100 have been described above while referring to a case where the time-series data includes an abnormal or similar time series subsequence, the time-series data analysis device 100 is useful for analysis of any time-series data. For example, even in a case where no anomaly data is included, the time-series data analysis device 100 according to the present disclosure is useful. This will be described with reference to
Since the time-series data of
Even in the case where the right skirt of the probability density distribution becomes short as described above, the influence of removal of the anomaly data is equally reflected in the distribution of the normal data. Therefore, when probability density distributions of different window lengths are compared, relative features between the probability density distributions are not different. For example, the window length at which the probability density becomes maximum in the probability density distribution of the matrix profile created from the time-series data including the anomaly data is the same as the window length at which the probability density becomes maximum in the probability density distribution of the matrix profile created from the time-series data obtained by removing the anomaly data from the time-series data. In addition, the window length at which the standard deviation becomes minimum in the probability density distribution of the matrix profile created from the time-series data including the anomaly data is the same as the window length at which the standard deviation becomes minimum in the probability density distribution of the matrix profile created from the time-series data obtained by removing the anomaly data from the time-series data. In addition, the window length at which the skewness becomes maximum in the probability density distribution of the matrix profile created from the time-series data including the anomaly data is the same as the window length at which the skewness becomes maximum in the probability density distribution of the matrix profile created from the time-series data obtained by removing the anomaly data from the time-series data. In addition, the window length at which kurtosis becomes maximum in the probability density distribution of the matrix profile created from the time-series data including the anomaly data is the same as the window length at which kurtosis becomes maximum in the probability density distribution of the matrix profile created from the time-series data obtained by removing the anomaly data from the time-series data.
Therefore, as indicated by a circle C in
Therefore, even in a case where anomaly data has not yet been obtained from a certain device to be analyzed, the time-series data analysis device 100 according to the present disclosure can find a window length appropriate for analysis of the time-series data of the device to be analyzed. By monitoring the value of the matrix profile of the device to be analyzed using the found window length, it is possible to determine that an anomaly has occurred when a large matrix profile value is found.
In the first embodiment, the case where the feature calculated by the feature calculating unit 130 is the matrix profile has been described. In a second embodiment, a case where such feature is double amplitude (Peak-to-Peak) will be described with reference to
As illustrated in
The feature calculating unit 130A calculates the feature of the time-series data on the basis of the time-series data supplied from the time-series data input unit 110 and the parameter value supplied from the parameter setting unit 120. A double amplitude (Peak-to-Peak) is calculated as the feature. The feature calculating unit 130A changes the value of the length m of the time series subsequence within the range of the upper limit value and the lower limit value of the length of the time series subsequence set via the parameter setting unit 120, and calculates a double amplitude for each of various types of m. The feature calculating unit 130A calculates a double amplitude for each window length as illustrated in
As in the case of the first embodiment, the probability density distribution calculating unit 140 calculates a probability density distribution PDD for each m from the features supplied from the feature calculating unit 130A. The calculated probability density distribution PDD is supplied to the heat map creating unit 150 and the parameter selecting unit 160A.
The heat map creating unit 150 creates a heat map from the probability density distribution PDD for each m supplied from the probability density distribution calculating unit 140.
The parameter selecting unit 160A calculates a statistical feature of the probability density distribution PDD according to the algorithm illustrated in
The output unit 170 outputs the heat map supplied from the heat map creating unit 150 and the selected parameter supplied from the parameter selecting unit 160A to an external device such as a display device.
Some of various aspects of the embodiments described above will be summarized below.
A time-series data analysis device (100, 100A; 200) of Supplement 1 includes: a time-series data input unit (110) to receive time-series data; a parameter setting unit (120, 120A) to set a range of a window length of a time series subsequence in the time-series data: a feature calculating unit (130; 130A) to calculate a feature of the time-series data for each of a plurality of window lengths within the range; a probability density distribution calculating unit (140) to calculate a probability density distribution of the calculated feature for each of the plurality of window lengths; and a parameter selecting unit (160; 160A) to calculate a statistical feature of the probability density distribution calculated for each of the plurality of window lengths and select a window length to be used from among the plurality of window lengths on the basis of the calculated statistical feature.
A time-series data analysis device (100, 100A; 200) of Supplement 2 is the time-series data analysis device according to Supplement 1. in which the feature is a value of a matrix profile or a double amplitude value.
A time-series data analysis device (100, 100A; 200) of Supplement 3 is the time-series data analysis device of Supplement 1 or 2. in which the parameter setting unit (120A) sets the range by calculating at least one of an upper limit value and a lower limit value on the basis of the received time-series data.
A time-series data analysis device (100, 100A; 200) of Supplement 4 is the time-series data analysis device of any one of Supplements 1 to 3, in which the statistical feature is a maximum value, and the parameter selecting unit selects a window length having a maximum probability density as the window length to be used.
A time-series data analysis device (100. 100A; 200) of Supplementary note 5 is the time-series data analysis device of any one of Supplements 1 to 4, in which the statistical feature is a standard deviation, and the parameter selecting unit selects a window length having a minimum standard deviation as the window length to be used.
A time-series data analysis device (100, 100A; 200) of Supplement 6 is the time-series data analysis device of any one of Supplements 1 to 5, in which the statistical feature is a skewness, and the parameter selecting unit selects a window length having a positive or negative skewness as the window length to be used.
A time-series data analysis device (100, 100A: 200) of Supplement 7 is the time-series data analysis device of any one of Supplements 1 to 6, in which the statistical feature is a kurtosis, and the parameter selecting unit selects a window length having a maximum kurtosis as the window length to be used.
A time-series data analysis device (100, 100A; 200) of Supplement 8 is the time-series data analysis device of any one of Supplements 1 to 7, and further includes a heat map creating unit (150) to create a heat map of the calculated probability density distribution from the calculated probability density distribution.
A time-series data analysis method of Supplement 9 includes the steps of: receiving, by a time-series data input unit (110), time-series data (ST101 ); setting, by a parameter setting unit (120), a range of a window length of a time series subsequence in the time-series data (ST102); calculating, by a feature calculating unit (130), a feature of the time-series data for each of a plurality of window lengths within the range (ST103): calculating, by a probability density distribution calculating unit (140), a probability density distribution of the calculated feature for each of the plurality of window lengths (ST104); and calculating, by a parameter selecting unit (160), a statistical feature of the probability density distribution calculated for each of the plurality of window lengths and selecting a window length to be used from among the plurality of window lengths on a basis of the calculated statistical feature (ST105).
A time-series data analysis program of Supplement 10 causes a computer to execute: a time-series data input function of receiving time-series data; a parameter setting function of setting a range of a window length of a time series subsequence in the time-series data; a feature calculating function of calculating a feature of the time-series data for each of a plurality of window lengths within the range; a probability density distribution calculating function of calculating a probability density distribution of the calculated feature for each of the plurality of window lengths; and a parameter selecting function of calculating a statistical feature of the probability density distribution calculated for each of the plurality of window lengths and selecting a window length to be used from among the plurality of window lengths on a basis of the calculated statistical feature.
Note that the embodiments can be combined, and the embodiments can be appropriately modified or omitted.
Since the time-series data analysis device according to the present disclosure includes the parameter selecting unit, it is possible to propose a window length used when analyzing the time-series data. Thus, that time-series data analysis device can be used to analyze time-series data with an unknown suitable window length.
100: time-series data analysis device, 100A: time-series data analysis device, 110: time-series data input unit. 120: parameter setting unit, 120A: parameter setting unit, 130: feature calculating unit, 130A: feature calculating unit, 140: probability density distribution calculating unit, 150: heat map creating unit, 160: parameter selecting unit, 160A: parameter selecting unit, 170: output unit, 200: time-series data analysis device, 401: processor, 402: memory, 403: I/F device. 404: processing circuit
This application is a Continuation of PCT International Application No. PCT/JP2021/006025 filed on Feb. 18, 2021, which is hereby expressly incorporated by reference into the present application.
Number | Date | Country | |
---|---|---|---|
Parent | PCT/JP2021/006025 | Feb 2021 | WO |
Child | 18216107 | US |