The technology described in this patent document relates generally to data analysis techniques for transactional databases.
Transactional databases typically collect large amounts of time-stamped data relating to an organization's suppliers or customers over time. Examples of such transactional databases include web sites, point-of-sale (POS) systems, call centers, inventory systems, and others. Data mining techniques are often used to derive knowledge from such transactional databases. However, the size of each set of transactional data may be quite large, making it difficult to perform many traditional data mining tasks.
In accordance with the teachings described herein, systems and methods are provided for analyzing transactional data. A similarity analysis program may be used that receives time-series data relating to transactions of an organization and performs a similarity analysis of the time-series data to generate a similarity matrix. A data reduction program may be used that receives the time-series data and performs one or more dimension reduction operations on the time-series data to generate reduced time-series data. A distance analysis program may be used that performs a distance analysis using the similarity matrix and the reduced time-series data to generate a distance matrix. A data analysis program may be used that performs a data analysis operation, such as a data mining operation, using the distance matrix to generate a data mining analysis of the transactional data.
The time-series data 40 is input to both the data reduction block 32 and the similarity analysis block 34. The time-series data 40 is made up of transactional data that is stored with some indication of time (referred to herein as “time-stamped data”) and that is accumulated over time at a particular frequency. Some examples of time-series data include web sites visited per hour, sales per month, inventory draws per week, calls per day, trades per weekday, etc.
The data reduction block 32 performs one or more dimension reduction operations on the time-series data 40 to generate reduced time-series data 46. Traditional data mining techniques are typically applied to large data sets with observation vectors that are relatively small in dimension when compared to the length of a time-series. In order to effectively apply these data mining techniques to a large number of series, the dimensions of each series may be reduced to a small number of statistics that capture their descriptive properties. Examples of dimension reduction operations that may be used, possibly in combination, to capture the descriptive properties for each time series include time domain analysis, frequency domain analysis, seasonal adjustment/decomposition and time-series modeling.
The similarity analysis block 34 performs a similarity analysis of the time-series data to generate a similarity matrix 48. In order to perform a similarity analysis, the component time series in the time-series data 40 are compared. Given two ordered numeric sequences (input and target), such as two time series, a similarity measure is a metric that measures the distance between the input and target sequences while taking into account the ordering. The similarity matrix 48 is generated by computing similarity measures between multiple time series.
The distance analysis block 36 performs a distance analysis using the similarity matrix 48 and the reduced time-series data 46 to generate a distance matrix 50. In order to generate a distance matrix 50, the reduced time-series data 46 and distance matrix 50 are combined into a data matrix with uniform dimensions, such as a series properties matrix. Statistical distances are then computed between vectors in the data matrix to generate the distance matrix 50.
The data mining block 38 performs one or more data mining operations using the distance matrix 50 to generate the data mining analysis 42. Numerous data mining techniques may be used by the data mining block 38 to generate the data mining analysis 42 with information that is useful for evaluating the time series data. Examples include sampling, clustering, classification and decision trees.
The accumulation block 102 receives a plurality of sets of time-stamped transactional data 114 and accumulates the time-stamped data 114 into a plurality of sets of time-series data 116. The accumulation of time-stamped data 114 into time-series data 116 is based on a particular frequency. For example, time-stamped data 114 can be accumulated to form hourly, daily, weekly, monthly or yearly time series. Additionally, the method for accumulating the transactions within each time period is based on a particular statistic. For example, the sum, mean, median, minimum, maximum, standard deviation and/or other statistics can be used to accumulate the transactions within a particular time period. As an example,
Referring again to
Many transactional and time series databases store data in longitudinal form, whereas many data mining software packages require the data to be in coordinate form. The dimension reduction operation(s) performed by the dimension reduction block 104 may extract features of the longitudinal dimension of the series and store the reduced sequence in coordinate form of fixed dimension. For instance, assume that there are N series with lengths {T1, . . . , TN}. In longitudinal form, each variable (or column) represents a single series, and each variable observation (or row) represents the series value recorded at a particular time. Notice that the length of each series, Ti, can vary: Yi={yi,t}t=1T
In coordinate form, each observation (or row) represents a single reduced sequence, and each variable (or column) represents the reduced sequence value. Notice that the length of each reduced sequence, M, is fixed: Ri={ri,m}m=1M for i=1, . . . , N, where Ri is (1×M). This form is convenient for data mining but less desirable for time series analysis.
To reduce a single series, a univariate reduction transformation may be used to map the varying longitudinal dimension to the fixed coordinate dimension: Ri=Fi[Yi] for i=1, . . . , N, where Ri is (1×M), Yi is (Ti×1), and Fi[ ] is the reduction transformation (e.g., seasonal decomposition). For multivariate series reduction, more than one series is reduced to a single reduction sequence. The bivariate case may be expressed as: Ri=Fi[Yi,Xi] for i=1, . . . , N, where Ri is (1×M), Yi is (Ti×1), Xi is (Ti×1), and Fi[ ] is the reduction transformation (e.g., cross-correlations). It should be understood that the reduction transformation, Fi[ ], is indexed by the series index, i=1, . . . , N, but typically it does not vary and thus is assumed to be the same, that is, F[ ]=Fi[ ].
Tables 1 and 2, set forth below, illustrate an example of the above formulas in tabular form. In the tables, a single period ‘.’ refers to a missing value, and three consecutive periods ‘ . . . ’ refer to continuation.
In the above example, dimension reduction transforms the series table (Table 1) (T×N) to the reduced table (Table 2) (N×M), where T=max {T1, . . . , TN} and where typically M<T. The number of series, N, can be quite large; therefore, even a simple reduction transform requires the manipulation of a large amount of data. Hence, the data should be put into the proper format to avoid having to post-process large data sets.
As described above, transactional and time series analysis can reduce a single transactional or time series to a relatively small number of descriptive statistics. In addition, the reduced data can be combined or merged with other categorical data (e.g., age, gender, income, etc.) For example, suppose that the rather large transaction history of a single customer is reduced to a small number of statistics using seasonal decomposition; the seasonal indices can be combined with the customer's income and gender. This combined data set can then be analyzed using both the seasonal indices and categorical data. Such an analysis may be difficult or impossible if the high-dimension series data was used directly.
Referring again to
Given sets of time-series data 116, a specified similarity analysis technique, Sim( ), may be used to measure the similarity between each series: si,j=Sim({right arrow over (y)}i,{right arrow over (y)}j), where Sim( ) measures the similarity between the ith and jth series. The resulting similarity matrix, S={{right arrow over (s)}i}i=1N, has uniform dimension (N×N), which are needed for many data mining techniques. There are many known similarity analysis techniques which may be used in this manner to generate a similarity matrix 122. For example, si,j may represent the dynamic time warping similarity measure between the ith and jth products. In another example, si,j may represent the derivative dynamic time warping measure between the ith and jth products. In yet another example, si,j may represent the longest common subsequence similarity measure between the ith and jth products.
Referring again to
Given the series properties vector (ZP) for each series, the distance measure block 110 can compute statistical distances between the series properties vectors using some statistical distance measure to generate a distance matrix 131. The distance measure may, for example, be specified by an analyst. In this manner, the statistical distances may be calculated as follows: di,j=D({right arrow over (z)}iP, {right arrow over (z)}iP), where D( ) measures the distance between the ith and jth series properties vector. The statistical distances may, for example, be calculated using the SAS/STAT® software sold by SAS Institute Inc. of Cary, N.C.
In this manner, some or all of the series properties may be used to compute the distances. For example, di,j could represent the distance between diffusion parameters of the ith and jth products. In other examples, di,j could represent the distance between growth parameters, exponential decay parameters, dynamic time warping similarity measures, derivative dynamic time warping similarity measures, or other distance measures. A time-independent distance matrix, D, associated with all of the time series and having dimensions (N×N) may then be calculate as follows: D={{right arrow over (d)}i,j}i-1N, where {right arrow over (d)}i,j={di,j}j=1N is the distance vector associated with the ith series.
Referring again to
Another example data mining operation that may be performed using the data mining block 112 is a cluster analysis. A cluster analysis of the transactional data may be performed to place the transactional data objects into groups or clusters suggested by the data, such that objects in a given cluster tend to be similar and objects in different clusters tend to be dissimilar. Depending on the clustering algorithm being employed, coordinate data, distance data or a correlation or covariance matrix can be used in the cluster analysis. Given a set of reduced (coordinate) data, clustering analysis groups the rows of the reduced data. Because each reduced matrix row, Ri uniquely maps to one series, Yi, the series are indirectly clustered. Likewise, given a distance matrix, clustering analysis groups the rows of the distance matrix. Because each distance matrix row, Di, uniquely maps to one series, Yi, the series are indirectly clustered.
Another example data mining operation that may be performed using the data mining block 112 is a sampling analysis. A sampling analysis of the transactional data may be performed to extract samples from the data. For instance, in order to efficiently explore, model and assess model results for the transactional data, random samples may be extracted from the database. The sample should be large enough to contain the significant information, yet small enough to process. In addition, for some data mining techniques, such as neural networks, more than one sample may be needed (e.g., training, testing and validation samples). Once a model has been assessed or evaluated for effectiveness, the model can then be applied to the entire database.
There are various ways to sample data. For instance, observation sampling may be more appropriate for reduced (coordinate) data, and longitudinal sampling may be more appropriate for series (longitudinal) data. To illustrate observation sampling, assume that there are N observations in the reduced (coordinate) data. Typically, for coordinate data, a random sample would be selected from the N observations; that is, NSAMPLE random integers would be selected between one and N without replacement. The sample data would then be created from the observations whose observation index corresponds to one of the NSAMPLE randomly selected integers. The sample data dimensions are (NSAMPLE×M) with RSAMPLE{R1, . . . , RN}.
To illustrate longitudinal sampling, assume that there are N series with lengths {T1, . . . TN}. Then, the total number of observations is NOBSTOTAL=T1+ . . . +TN. Using observation sampling, a random sample would be selected from the NOBSTOTAL observations; that is, NOBSSAMPLE random integers would be selected between one and NOBSTOTAL without replacement. The sample data would then be created from the observations whose observation index corresponds to one of the NOBSSAMPLE randomly selected integers. However, for series data, observation sampling is inadequate because the ability to exploit the relationship between observations within a single series is lost.
For series (longitudinal) data, a random sample should be selected from the N series; that is, NSAMPLE random numbers would be selected between one and N without replacement. The sample data would then be created from the series whose series index corresponds to one of the NSAMPLE randomly selected integers. For series data, longitudinal sampling is more appropriate; and the sample data dimensions are (NSAMPLE×T) with YSAMPLE{Y1, . . . , YN} For multivariate series analysis, all of the covariate series should be randomly selected jointly.
Referring first to
Referring now to
Referring to
Referring now to
Fitted time series models can be used to forecast time series. These forecasts can be used to predict future observations as well as to monitor more recent observations for anomalies using holdout sample analysis. For example, after fitting a time series model to the time series data with a holdout sample excluded, the fitted model can be used to forecast within the holdout region. Actual values in the holdout sample that are significantly different from the forecasts could be considered anomalies. Some example statistics that can be used in a holdout sample analysis include performance statistics (e.g., RMSE, MAPE, etc.), prediction errors (e.g., absolute prediction errors that are three times prediction standard errors), confidence limits (e.g., actual values outside confidence limits), and many others.
This written description uses examples to disclose the invention, including the best mode, and also to enable a person skilled in the art to make and use the invention. The patentable scope of the invention may include other examples that occur to those skilled in the art. For instance,
It is further noted that the systems and methods described herein may be implemented on various types of computer architectures, such as for example on a single general purpose computer or workstation, or on a networked system, or in a client-server configuration, or in an application service provider configuration.
It is further noted that the systems and methods may include data signals conveyed via networks (e.g., local area network, wide area network, internet, etc.), fiber optic medium, carrier waves, wireless networks, etc. for communication with one or more data processing devices. The data signals can carry any or all of the data disclosed herein that is provided to or from a device.
Additionally, the methods and systems described herein may be implemented on many different types of processing devices by program code comprising program instructions that are executable by the device processing subsystem. The software program instructions may include source code, object code, machine code, or any other stored data that is operable to cause a processing system to perform methods described herein. Other implementations may also be used, however, such as firmware or even appropriately designed hardware configured to carry out the methods and systems described herein.
The data corresponding to the systems and methods (e.g., associations, mappings, etc.) may be stored and implemented in one or more different types of computer-implemented ways, such as different types of storage devices and programming constructs (e.g., data stores, RAM, ROM, Flash memory, flat files, databases, programming data structures, programming variables, IF-THEN (or similar type) statement constructs, etc.). It is noted that data structures describe formats for use in organizing and storing data in databases, programs, memory, or other computer-readable media for use by a computer program.
The systems and methods may be provided on many different types of computer-readable media including computer storage mechanisms (e.g., CD-ROM, diskette, RAM, flash memory, computer's hard drive, etc.) that contain instructions for use in execution by a processor to perform the methods' operations and implement the systems described herein.
The computer components, software modules, functions, data stores and data structures described herein may be connected directly or indirectly to each other in order to allow the flow of data needed for their operations. It is also noted that a module or processor includes but is not limited to a unit of code that performs a software operation, and can be implemented for example as a subroutine unit of code, a software function unit of code, an object (as in an object-oriented paradigm), an applet, in a computer script language, or as another type of computer code. The software components and/or functionality may be located on a single computer or distributed across multiple computers depending upon the situation at hand.
This application claims priority from U.S. Provisional Patent Application No. 60/789,862, titled “Systems and Methods for Mining Time Series Data,” filed on Apr. 6, 2006, the entirety of which is incorporated herein by reference.
Number | Name | Date | Kind |
---|---|---|---|
5991740 | Messer | Nov 1999 | A |
5995943 | Bull et al. | Nov 1999 | A |
6052481 | Grajski et al. | Apr 2000 | A |
6128624 | Papierniak et al. | Oct 2000 | A |
6151584 | Papierniak et al. | Nov 2000 | A |
6169534 | Raffel et al. | Jan 2001 | B1 |
6189029 | Fuerst | Feb 2001 | B1 |
6208975 | Bull et al. | Mar 2001 | B1 |
6216129 | Eldering | Apr 2001 | B1 |
6286005 | Cannon | Sep 2001 | B1 |
6308162 | Ouimet et al. | Oct 2001 | B1 |
6317731 | Luciano | Nov 2001 | B1 |
6334110 | Walter et al. | Dec 2001 | B1 |
6397166 | Leung et al. | May 2002 | B1 |
6400853 | Shiiyama | Jun 2002 | B1 |
6526405 | Mannila et al. | Feb 2003 | B1 |
6539392 | Rebane | Mar 2003 | B1 |
6542869 | Foote | Apr 2003 | B1 |
6564190 | Dubner | May 2003 | B1 |
6591255 | Tatum et al. | Jul 2003 | B1 |
6640227 | Andreev | Oct 2003 | B1 |
6735738 | Kojima | May 2004 | B1 |
6775646 | Tufillaro et al. | Aug 2004 | B1 |
6792399 | Phillips et al. | Sep 2004 | B1 |
6850871 | Barford et al. | Feb 2005 | B1 |
6878891 | Josten et al. | Apr 2005 | B1 |
6928398 | Fang et al. | Aug 2005 | B1 |
7072863 | Phillips et al. | Jul 2006 | B1 |
7103222 | Peker | Sep 2006 | B2 |
7171340 | Brocklebank | Jan 2007 | B2 |
7216088 | Chappel et al. | May 2007 | B1 |
7236940 | Chappel | Jun 2007 | B2 |
7251589 | Crowe et al. | Jul 2007 | B1 |
7260550 | Notani | Aug 2007 | B1 |
7610214 | Dwarakanath et al. | Oct 2009 | B1 |
20020169657 | Singh et al. | Nov 2002 | A1 |
20030105660 | Walsh et al. | Jun 2003 | A1 |
20030110016 | Stefek et al. | Jun 2003 | A1 |
20030187719 | Brocklebank | Oct 2003 | A1 |
20030200134 | Leonard et al. | Oct 2003 | A1 |
20040172225 | Hochberg et al. | Sep 2004 | A1 |
20050102107 | Porikli | May 2005 | A1 |
20050249412 | Radhakrishnan et al. | Nov 2005 | A1 |
20060063156 | Willman et al. | Mar 2006 | A1 |
20060064181 | Kato | Mar 2006 | A1 |
20060112028 | Xiao et al. | May 2006 | A1 |
20060143081 | Argaiz | Jun 2006 | A1 |
20060247900 | Brocklebank | Nov 2006 | A1 |
20070291958 | Jehan | Dec 2007 | A1 |
20080294651 | Masuyama et al. | Nov 2008 | A1 |
Number | Date | Country | |
---|---|---|---|
20070239753 A1 | Oct 2007 | US |
Number | Date | Country | |
---|---|---|---|
60789862 | Apr 2006 | US |