This disclosure is generally related to characterizing time series data. More specifically, this disclosure is related to a system and method for characterizing a time series of arbitrary length using pre-selected signatures.
Time series data is a temporal sequence of real or categorical variables. That is, time series data is a series of data points indexed (or otherwise represented) in a temporal order and generally at successively equally spaced points in time. An example of time series data is power generated over time from a wind turbine. Analyzing time series data may be challenging due to the large dimensionality of the data and a lack of knowledge or variation in the time scales of interest. For example, one minute of time series data in the wind turbine example can include sample data recorded by a sensor which generates 1,000 samples per second, for a total of 60,000 samples. The dimensionality (i.e., massive volume) of this time series data can create challenges in analyzing the data. A lack of knowledge regarding which portions of the voluminous 60,000 samples may contain useful data can also create challenges in analyzing the data.
One embodiment provides a system for facilitating characterization of a time series of data associated with a physical system. During operation, the system determines, by a computing device, one or more signatures, wherein a signature indicates a basis function for a known time series of data. The system trains a neural network based on the signatures as a known output. The system applies the trained neural network to the time series to generate a probability that the time series is characterized by a respective signature. The system enhances an analysis of the time series data and the physical system based on the probability.
In some embodiments, the system applies the trained neural network to a first portion of the time series to generate, for each signature, a first probability that the time series is characterized by a respective signature, wherein the first portion has a first length and includes a first number of most recent entries of the time series. The system determines a second portion of the time series, wherein the second portion has a second length and includes a second number of most recent entries of the time series. The system reduces the second number of entries. The system applies the trained neural network to the reduced second number of entries to generate, for each signature, a second probability that the time series is characterized by a respective signature. The system characterizes the time series based on the first probability and the second probability.
In some embodiments, determining the second portion, reducing the second number of entries, and applying the trained neural network to the reduced second number of entries are in response to determining that a length of the first portion scaled by an integer is less than a total length of the time series. The system also sets the second portion as the first portion, and perturbs the integer. Characterizing the time series is further based on one or more second probabilities.
In some embodiments, the second number is equal to the first number scaled by an integer, reducing the second number of entries is based on the integer, and the time series is of an arbitrary length.
In some embodiments, the time series is of a fixed length.
In some embodiments, the network is trained based on one or more of: data generated from the signatures; an input which is a time series corresponding to a signature; and an output which is a one-hot vector of a size equal to a number of determined signatures, wherein a vector entry with a value equal to one corresponds to an index associated with a signature.
In some embodiments, the generated probability indicates a relative proportion or weight for each signature that the time series is characterized by the respective signature, and the relative proportion or weight is a comparison of the respective signature to all the determined signatures.
In some embodiments, the neural network is a recurrent neural network.
In the figures, like reference numerals refer to the same figure elements.
The following description is presented to enable any person skilled in the art to make and use the embodiments, and is provided in the context of a particular application and its requirements. Various modifications to the disclosed embodiments will be readily apparent to those skilled in the art, and the general principles defined herein may be applied to other embodiments and applications without departing from the spirit and scope of the present disclosure. Thus, the present invention is not limited to the embodiments shown, but is to be accorded the widest scope consistent with the principles and features disclosed herein.
The embodiments described herein solve the problem of effectively analyzing a high volume of time series data. Time series data is a temporal sequence of real or categorical variables, e.g., a series of data points indexed (or otherwise represented) in a temporal order. Time series data is generally taken at successively equally spaced points in time. An example of time series data is power generated over time from a wind turbine. Analyzing time series data may be challenging due to the large dimensionality of the data and a lack of knowledge or variation in the time scales of interest. For example, one minute of time series data in the wind turbine example can include sample data recorded by a sensor which generates 1,000 samples per second, for a total of 60,000 samples. The dimensionality (i.e., massive volume) of this time series data can create challenges in analyzing the data. A lack of knowledge regarding which portions of the voluminous 60,000 samples may contain useful data can also create challenges in analyzing the data effectively.
The embodiments described herein address these challenges by providing a signature detector system which identifies the relative proportions of known signatures in a fixed-length time series, and further characterizes an arbitrary-length time series based on the signature detector. The system trains a recurrent neural network based on pre-selected known signatures and training data. Subsequently, the system uses as input into the trained recurrent neural network time series data of a fixed length, and generates as output a probability for each signature that the time series is characterized by a respective signature. Training a recurrent neural network is described below in relation to
The system can then use the signature detector to characterize an arbitrary-length time series by sequentially using as an input time series set an increasing number of entries which are downsampled to a fixed length. In this case, the output of the signature detector is a number of probability vectors for each sequential input time series set, where each vector includes a probability that for each signature that the arbitrary-length time series is characterized by a respective signature. Characterizing an arbitrary-length time series using the signature detector and a downsampling module are described below in relation to
Thus, the embodiments described herein provide a computer system which improves the efficiency of analyzing a massive set of time series data associated with a physical system, where the improvements are fundamentally technological. The improved efficiency can include a signature detector which identifies relative proportions of known signatures in a fixed-length time series, where the signature detector may be used to characterize an arbitrary-length time series data. The computer system provides a technological solution (i.e., a method for characterizing an arbitrary-length time series of data associated with a physical system based on pre-selected signatures) to the technological problem of analyzing voluminous time series data associated with a physical system. This technological solution enhances an analysis of a massive volume of time series data, and can increase both the efficiency and effectiveness of the physical system.
In the embodiments described herein, a user of a computer system can use fixed-length time series data associated with physical objects to determine relative weights of known signatures (e.g., the signature detector whose output is visualized as a probability vector), and subsequently alter a physical feature associated with the physical object being measured. Furthermore, the user can use the signature detector to determine the relative weights of the known signatures, and again alter a physical feature associated with the physical object being measured. The user can also make any alteration to improve the efficiency and effectiveness of the physical system, based on the output of the signature detector and characterization of an arbitrary-length time series of data associated with the physical system.
Thus, upon receiving time series data 130, training data 136, and selected length and signatures 131, device 106 can train a recurrent neural network (train network function 132), where each input time series corresponds to one signature. The output is an M-size vector 133, which is a one-hot vector of a size M, where the vector entry equal to one corresponds to the index associated with the signature. Device 106 can generate and send M-size vector 133 back to device 102 for further analysis.
Subsequently, the recurrent neural network may be used to generate a probability vector (also of size M) given a fixed size input, where the relative probabilities indicate the relative weight of a corresponding signature for the input data.
Upon receiving both time series data 140 and selected length 141, device 106 can use the previously trained recurrent neural network (described above in relation to
The three signatures 431, 432, and 433 may be represented in a one-hot vector of size M=three. In a respective signature vector, the vector entry with a value equal to one corresponds to the index of the respective signature. For example: signature 431 may have a respective signature vector 441 with index values [i1, i2, i3]=[1, 0, 0]; signature 432 may have a respective signature vector 442 with index values [i1, i2, i3]=[0, 1, 0]; and signature 433 may have a respective signature vector 443 with index values [i1, i2, i3]=[0, 0, 1]. Signatures 431, 432, and 433 are discussed below in the exemplary probability vector of
Network 500 may be trained using synthetic or training data. The system can select a length L of a given time series (e.g., time series 300), and also select M known signatures (e.g., M=3, the three signatures of
Upon training network 500, a time series of a fixed length L may be passed in as input to generate a probability vector of a size M.
Thus, by identifying the relative proportion of known signatures, the system provides a signature detector whose output is a probability vector with components representing the known signatures. The system characterizes a fixed-length time series based on the known signatures. This characterization can guide a user or other client to effect improvements to increase the efficiency of another system. For example, in the wind turbine example, the system can use a fixed-length reading from sensor 110.3 to determine relative proportions of certain known signatures (as a probability vector), and, based on the probability vector, the user can subsequently modify a speed, size, direction, or other feature of blade 110 or another feature relating to the readings taken by sensor 110.3. The embodiments described herein allow a user to modify a physical feature associated with the physical object being measured, which provides a concrete technological solution to a technological problem by enhancing the analysis of a voluminous amount of time series data.
At time 712, the system can pass into signature detector 720 the input 732 of a length L=4: [(Xt-3), (Xt-2), (Xt-1), (Xt)], and generate an output 714 of [Psig_1, Psig_2, Psig_3]. Output 714 is similar to probability vector 620, in that the size M=3 and characterizes the time series based on the relative probabilities for the M signatures.
Subsequently, the system can select the latest H*L entries of the time series, where H is any positive integer, and reduce (i.e., downsample) the selected entries by a factor of H to obtain L entries. The system can again generate an M-size probability vector. The system can continue to increase H and repeat the selecting, reducing, and generating steps until H*L is greater than the arbitrary length Q.
Specifically, at time 714, the system can set H=2, and select the latest 2*4=8 entries of the time series. The system can pass into downsampler module 718 the input 742 with H*L=8 entries: [(Xt-7), (Xt-6), (Xt-5), (Xt-4), (Xt-3), (Xt-2), (Xt-1), (Xt)]. Downsampler module 718 can use any downsampling algorithm to reduce the number of entries by a factor of H to obtain L entries. The system can subsequently send to signature detector 720 the downsampled input 743 of a length L=4: [(Xt-7), (Xt-5), (Xt-3), (Xt-1)], and generate an output 744 of [Psig_1, Psig_2, Psig_3].
At time 716, the system can increase H by 1 and set H=2+1=3, and select the latest 3*4=12 entries of the time series. The system can pass into downsampler module 718 the input 752 with H*L=12 entries: [(Xt-11), (Xt-10), (Xt-9), (Xt-8), (Xt-7), (Xt-6), (Xt-5), (Xt-4), (Xt-3), (Xt-2), (Xt-1), (Xt)]. The system can subsequently send to signature detector 720 the downsampled input 753 of a length L=4: [(Xt-11), (Xt-8), (Xt-4), (Xt-1)], and generate an output 754 of [Psig_1, Psig_2, Psig_3].
At time 718, the system can increase H by 1 and set H=3+1=4, and select the latest 4*4=16 entries of the time series. The system can pass into downsampler module 718 the input 762 with H*L=16 entries: [(Xt-15), (Xt-14), (Xt-13), (Xt-12), (Xt-11), (Xt-10), (Xt-9), (Xt-8), (Xt-7), (Xt-6), (Xt-5), (Xt-4), (Xt-3), (Xt-2), (Xt-1), (Xt)]. The system can subsequently send to signature detector 720 the downsampled input 763 of a length L=4: [(Xt-15), (Xt-11), (Xt-7), (Xt-3)], and generate an output 764 of [Psig_1, Psig_2, Psig_3].
The system can determine that increasing H again leads to H*L being greater than Q (i.e., 5*4>16). The collection of the M-size probability vectors (i.e., outputs 734, 744, 754, and 764) is a probability vector 780 which characterizes the time series of arbitrary length Q=16 given the selected M=3 signatures.
Thus, by using the signature detector of
The characterization of the time series of a fixed length may thus be represented as a probability vector of size M, where M is the number of pre-selected signatures (e.g., probability vector 620 of size M=3 of
If the length of the first portion scaled by the integer is not less than the total length of the time series, the system determines a second portion of the time series, wherein the second portion has a second length and a second number of most recent entries equal to the first number scaled by the integer (operation 906). The system reduces the second number of entries based on the integer (operation 908) (i.e., downsampling as described above in relation to
The characterization of the time series of an arbitrary length may thus be represented as a probability vector of size N*M, where N is the number of times that input data is passed through the signature detector to obtain an output probability vector of size M, and M is the number of pre-selected signatures (e.g., probability vector 780 of size N*M=4*3=12 of
Content-processing system 1018 can include instructions, which when executed by computer system 1002, can cause computer system 1002 to perform methods and/or processes described in this disclosure. Specifically, content-processing system 1018 may include instructions for sending and/or receiving data packets to/from other network nodes across a computer network (communication module 1020). A data packet can include time series data, training data, synthetic data, a vector, a selected length, and selected signatures.
Content-processing system 1018 can further include instructions for determining one or more signatures, wherein a signature indicates a basis function for a time series of data (signature-selecting module 1022). Content-processing system 1018 can include instructions for training a neural network based on the signatures as a known output (network-training module 1024). Content-processing system 1018 can also include instructions for applying the trained neural network to the time series to generate a probability that the time series is characterized by a respective signature (probability-generating module 1026). Content-processing system 1018 can include instructions for enhancing an analysis of the time series data and the physical system based on the probability (time-series characterizing module 1028).
Furthermore, content-processing system 1018 can include instructions for applying the trained neural network to a first portion of the time series to generate, for each signature, a first probability that the time series is characterized by a respective signature, wherein the first portion has a first length and includes a first number of most recent entries of the time series (probability-generating module 1026). Content-processing system 1018 can include instructions for determining a second portion of the time series, wherein the second portion has a second length and includes a second number of most recent entries of the time series (time-series characterizing module 1028). Content-processing system 1018 can also include instructions for reducing the second number of entries (down-sampling module 1030). Content-processing system 1018 can include instructions for applying the trained neural network to the reduced second number of entries to generate, for each signature, a second probability that the time series is characterized by a respective signature (probability-generating module 1026). Content-processing system 1018 can additionally include instructions for characterizing the time series based on the first probability and the second probability (time-series characterizing module 1028).
Data 1032 can include any data that is required as input or that is generated as output by the methods and/or processes described in this disclosure. Specifically, data 1032 can store at least: a signature; a basis function; a time series; a length; a neural network; a recurrent neural network; a trained network; an input; an input time series; an output; an output vector; a one-hot vector; an index; a probability vector; a positive integer; a portion of a time series; a probability vector which indicates a probability for each signature that a time series is characterized by a respective signature; a most recent number of entries of a time series; a reduced number of entries; a downsampling algorithm; data generated from signatures; a relative proportion or weight for each signature that the time series is characterized by a respective signature; and a comparison of one signature to multiple signatures.
The data structures and code described in this detailed description are typically stored on a computer-readable storage medium, which may be any device or medium that can store code and/or data for use by a computer system. The computer-readable storage medium includes, but is not limited to, volatile memory, non-volatile memory, magnetic and optical storage devices such as disk drives, magnetic tape, CDs (compact discs), DVDs (digital versatile discs or digital video discs), or other media capable of storing computer-readable media now known or later developed.
The methods and processes described in the detailed description section can be embodied as code and/or data, which can be stored in a computer-readable storage medium as described above. When a computer system reads and executes the code and/or data stored on the computer-readable storage medium, the computer system performs the methods and processes embodied as data structures and code and stored within the computer-readable storage medium.
Furthermore, the methods and processes described above can be included in hardware modules or apparatus. The hardware modules or apparatus can include, but are not limited to, application-specific integrated circuit (ASIC) chips, field-programmable gate arrays (FPGAs), dedicated or shared processors that execute a particular software module or a piece of code at a particular time, and other programmable-logic devices now known or later developed. When the hardware modules or apparatus are activated, they perform the methods and processes included within them.
The foregoing descriptions of embodiments of the present invention have been presented for purposes of illustration and description only. They are not intended to be exhaustive or to limit the present invention to the forms disclosed. Accordingly, many modifications and variations will be apparent to practitioners skilled in the art. Additionally, the above disclosure is not intended to limit the present invention. The scope of the present invention is defined by the appended claims.