Field of the Invention
The field of the invention relates to a system and method for analysing data.
Brief Description of the Related Art
Many companies in a large variety of industries store large volumes of data, which are increasing in volume over time. These large volumes of data may include, but are not limited to, financial transaction data, computer network, infrastructure data, environmental data, operational data and social statistics. The data can be analysed or mined to identify trends, anomalies, and/or patterns in the data. The identified trends, anomalies or patterns can be used to understand and address particular problems. Current methods of data analysis are good at retrieving specific content about the data. For example, they are very efficient at retrieving all transactions for a specific user or host.
Known data analysis systems, such as Splunk, are able to store, index and utilize the large volumes of data. Other systems are known that create complex data models from the data, for example a system supplied by Prelert, Inc. Framingham, Mass. These data models enable users to glean insights such as anomalies or trends into historical and newly ingested data.
The known systems generate the information by a complete re-analysis of the incoming data. This re-analysis of the data can take a large amount of time. For example, it is common to analyse the data overnight in a batch process, when a processor is not heavily used. This (re)-analysis of the data is suitable for the identification of trends in the data, for which no immediate action has to be made. On the other hand, if the results are required in real-time, for example because of a change in a trend or a series of anomalous results, then these prior art methods do not enable the provision of the information sufficiently quickly.
One example of a prior art method for the analysis of data is disclosed in the U.S. Pat. No. 8,832,120 issued Sep. 9, 2014. This patent document teaches a computer-based method of determining a so-called weirdness score for variables within a large data set.
A method for analysing data is disclosed. One or more data records are passed to a data analysis system. The data records comprised a plurality of data items. In the case of anomaly detection, the value of one or more of the data items in one or more of the data records is compared with an expected value derived from a statistical model. The statistical model is derived from previous data records. On identifying an abnormal value, i.e. a value that falls substantially outside of the range of expected values, then the data model can be updated to indicate an anomaly. The statistical model is updated using the passed data records using the earlier statistical model. The statistical models are persisted for use as more data items are analysed, and the data models are persisted as a database of insights into the original data records.
The method of this disclosure enables statistical models to be quickly and efficiently updated by using the previously calculated statistical model and updating the statistical models with the new data records. The resulting data model (containing insights such as anomalies) can be stored with the associated data records in the data base, which allows the data records to be readily accessed if required. For example, the user or supervisor might receive the message about the anomaly and wish to review the associated data records.
In the case of anomaly detection, the expected values could be described by a normal distribution parameterised by the mean and variance of some of the values of data items, an indication of the class of membership of one or more of the data items or membership of a cluster, or a periodic trend. These examples are not limiting of the invention. In one aspect of the invention, the data model is indexed to allow rapid retrieval.
The disclosure also teaches a system for analysing of the data that comprises at least one data entry device for the ingestion of at least one data record. A data analysis system accepts at least one of the plurality of data items from the data records and compares the value of the accepted one of the plurality of data items with an expected value. The data analysis system also updates a statistical model using the accepted one of the plurality of data items. An entry can be written to the data model if the value is abnormal or extraneous, i.e. lies outside a range of expected values. This entry can be reviewed by a user or administrator and the incident investigated. This forensic investigation can also use entries from the data model. The updated statistical model is stored in a statistical model.
Still other aspects, features, and advantages of the present invention are readily apparent from the following detailed description simply by illustrating preferable embodiments and implementations. The present invention is also capable of other and different embodiments and its several details can be modified in various obvious respects, all without departing from the spirit and scope of the present invention. Accordingly, the drawings and descriptions are to be regarded as illustrative in nature, and not as restrictive. Additional objects and advantages of the invention will be set forth in part in the description which follows and in part will be obvious from the description, or may be learned by practice of the invention.
For a more complete understanding of the present invention and the advantages thereof, reference is now made to the following description and the accompanying drawings, in which:
The invention will now be described on the basis of the drawings. It will be understood that the embodiments and aspects of the invention described herein are merely examples and do not limit the protective scope of the claims in any way. The invention is defined by the claims and their equivalents. It will be understood that features of one aspect or embodiment of the invention can be combined with a feature of a different aspect or aspects and/or embodiments of the invention.
The input of the data records 30 is made by means of a data entry device 15. The data entry device 15 collects the data records 30 from a number of sources, including but not limited to, user entry devices, such as terminal 90, sensors measuring physical quantities, the internet, HTTP requests, IP (or similar) addresses and an intranet. The data entry device 15 passes the data records 30 to the data analysis system 20. It will be seen that the data records 30 comprise a plurality of data input items, which are collectively labelled with the reference numeral 35. Each one of the data items 35 can be individually processed. Examples of the data items 35 include, but are not limited to, timestamps and values of data.
In one aspect of the invention it is possible to aggregate the values of the data over time and instead of storing multiple data records 30 only a single data record with an average value is stored, as will be explained later.
This comparison step 210 will highlight any insights in the newly input data items 35.
The statistical model initially starts off with an initial model that can be non-informative or can incorporate expert knowledge, such as CPU ranges from 0-100%. The aim of the method is to develop the statistical model 50 such that the statistical model 50 identifies relationships between different ones of the data items 35 and the comparison step 210 can identify insights, such as anomalies in the data items 35 because one or more values of the data items 35 are different than the expected values or the relationships between the data items 35 are different. It will be appreciated that initially the variance in values of the data items 35 may be large. Over a short period of time, it will be expected that the average range of values of each of the data items 35 is established and that the variance of the values decreases. For example, any daily variations in the values of the data items 35 should be identified within a few days, whereas monthly variations in the values of the data items 35 will take a few months. The relationships will be multi-dimensional and clustering of the data items 35 will be established, as shown in connection with
Should the comparison step 210 identify an “abnormal” value, i.e. a value lying substantially outside the range of expected values of one or more of the data items 35, then this abnormal value can be highlighted in one of the messages to the user or the supervisor in step 220. The user or supervisor uses this highlighted value to investigate the reasons for the abnormal value using the terminal. The user or supervisor can gain an insight into the data in which the abnormal value was identified.
The statistical model 50 is updated in update step 230 using the newly inputted data items 35 from input step 210. The updated statistical model 50 can be stored in the data base 60 together with the data records 30 with the data items 35 in storage step 240. This updating of the stored statistical model 50 in update step 230 happens in real time or could be initiated in a batch process, for example overnight when the system 10 has available processing capacity.
As noted above, not all of the data items 35 needs to be stored in the data base 60 or processed by the data analysis system 20. Indeed it is possible that the data items 35 are not stored in the system 10 at all, but are accessible from the system 10.
A data aggregator 70 can be used to aggregate or bucket together several of the data items 35 from different ones of the data records 30. For example, all of the values of one of the data items 35 could be averaged over a period of time. The data aggregator 70 would then provide the average value of the one of the data items 35 for use in the comparison step 210 and the update step 220 as well as storage in the data base 60. This saves in processing time and storage space. Additionally, the data items 35 coming from the same one of the data entry devices 15 could also be stored together or averaged, depending on the requirements of the system 10.
The statistical model 50 is self-learning. It is not created using a set of ‘training’ data that has been labelled. The data analysis system 20 using the data items 35 to create a series of relationships between the various data items 35. The relationships could be temporal relationships, i.e. that one of the data items 35 takes particular values after a certain amount of time, could be averages or means with standard deviations or could be examples of variances in the data. The statistical model 50 is continually updated by the newly ingested data records 30. So, for example, as a configuration of the system 10 changes then new data items 35 are entered and the statistical model 50 does not remain static, but is able to adjust its calculation based on the newly ingested data records 30.
In one aspect of the invention, the statistical model 50 can be used to predict future values and forecast events. It is also possible to calculate the probability of a particular event happening and then make a comparison after the event has happened.
The system and method of the disclosure can be used to determine changes in the occurrence of events and values of the data. Suppose that one of the incoming data records 30 has data items 35 which are measured at a time (t) and have a value (V) of the data for the occurrence of a particular event E. The data items 35 have a timestamp associated with them, which has the value t.
The values V of the data are used to develop and update the statistical model 50 in the store 40. In this example, the user is interested in the number of events E over time as well as the average V over time. The rate of change of the value of the data as well as the change in the number of events E is recorded in the statistical model having been calculated from previous values of the data records 30. The mean of the values of the data, the running total average of the values of the data and/or the standard deviation of the value are stored in the statistical model 50. The direct storage of these values in the data base 60 means that these values no longer need to be re-calculated if the supervisor 80 wishes to review the patterns of the data. The supervisor 80 can merely interrogate the data base 60 to obtain the values of interest. The raw values of the data, i.e. the values V and the timestamp can also be stored as part of the data record 30.
A baseline for normal behaviour is calculated, which is reflected in the statistical model 50. So, if the rate of change of the data, the mean of the data or the standard deviation is within the baseline calculated by the statistical model 50 then the data analysis system 20 will merely store these values in the data base 60. The data analysis system 20 can generate a new data model 55 or update an existing data model 55 in step 220 to reflect any abnormal behaviour and issue an message (such as an alert) if the data analysis system 20 detects that any one of the values deviates abnormally from the baseline. This can be indicated to the supervisor 80 at the terminal 90. This abnormal deviation is stored in the data model 55.
The supervisor 80 can use the data model 55 and directly access the data base 60 to see the stored data items 35 and review the previous updated statistical model and otherwise view or manipulate the data items 35.
A further use of the system and method of this disclosure is shown with respect to
One particular example which may be of interest to the supervisor 80 is shown by the arrow 320 on
A further use of the system and method would be in the use of financial trading. The statistical model 50 represents in this case the profile of financial traders and the transactions will be recorded in the data base 60 together with statistical model 50 representing the trades. One indication of whether a trader is carrying out the trades or whether this has been done by an automated bot is the rate of trades. The “normal” trading rate is a statistic that can be calculated by the data analysis system 20 and forms part of the statistical model. Should the trading rate increase rapidly, or come from a different IP address than expected, then these are abnormal values that are identified and recorded in the data model 55 to allow subsequent investigation.
The system and method could be used by a retail store to monitor purchases, stocks, revenue etc. In this example, the data aggregator 70 will be used to at least reduce the amount of storage required in the data base 60. The system and method are used to forecast and/or predict sales. Factors, such as holiday periods or weather patterns can be further stored and the statistical model 50 used to establish relationships. For example, the relationship between summer weather and purchase of barbecue sets. Any abnormalities in the sales are stored in the data model 55 and can be analysed for insights into the sales.
The type of relationships that can be established is dependent on the data ingested by the system in step 220. This will depend on the available data as well as the administrator's interests. To take Example 4, a relationship between summer weather and the purchase of barbecue sets is only possible if data relating to the summer weather (temperature, rainfall, etc.) is ingested as well as details of the sale of the barbecue sets.
The system and method can also enable forecasting or prediction of trends.
The foregoing description of the preferred embodiment of the invention has been presented for purposes of illustration and description. It is not intended to be exhaustive or to limit the invention to the precise form disclosed, and modifications and variations are possible in light of the above teachings or may be acquired from practice of the invention. The embodiment was chosen and described in order to explain the principles of the invention and its practical application to enable one skilled in the art to utilize the invention in various embodiments as are suited to the particular use contemplated. It is intended that the scope of the invention be defined by the claims appended hereto, and their equivalents. The entirety of each of the aforementioned documents is incorporated by reference herein.
The present application claims the benefit of the filing date of U.S. Provisional Patent Application Ser. No. 62/000,609 filed by the present inventors on May 20, 2014 and entitled “Method and system for analysing data.” The aforementioned provisional patent application is hereby incorporated by reference in its entirety.
Number | Name | Date | Kind |
---|---|---|---|
5511164 | Brunmeier et al. | Apr 1996 | A |
5905892 | Nielsen et al. | May 1999 | A |
6012152 | Douik et al. | Jan 2000 | A |
6820251 | Dwyer | Nov 2004 | B1 |
7200773 | Luick | Apr 2007 | B2 |
7280988 | Helsper | Oct 2007 | B2 |
7309833 | Robeson et al. | Dec 2007 | B2 |
7310590 | Bansal | Dec 2007 | B1 |
7451210 | Gupta | Nov 2008 | B2 |
7539593 | Machacek | May 2009 | B2 |
7881535 | McLaughlin | Feb 2011 | B1 |
8015279 | Christodoulou et al. | Sep 2011 | B2 |
8027439 | Zoldi | Sep 2011 | B2 |
8324260 | Garcia da Rocha | Dec 2012 | B1 |
8543689 | Dodson | Sep 2013 | B2 |
9038172 | Miller et al. | May 2015 | B2 |
9224067 | Lu et al. | Dec 2015 | B1 |
9314449 | Garcia da Rocha | Apr 2016 | B2 |
9407651 | Mathis | Aug 2016 | B2 |
9516053 | Sudhakar et al. | Dec 2016 | B1 |
20010000192 | Gonzalez | Apr 2001 | A1 |
20030065409 | Raeth | Apr 2003 | A1 |
20030101076 | Zaleski | May 2003 | A1 |
20050080806 | Doganata | Apr 2005 | A1 |
20060020924 | Lo | Jan 2006 | A1 |
20060167917 | Solomon | Jul 2006 | A1 |
20060190583 | Whalen | Aug 2006 | A1 |
20070277152 | Srinivasan | Nov 2007 | A1 |
20080021994 | Grelewicz | Jan 2008 | A1 |
20080071638 | Wanker | Mar 2008 | A1 |
20080114725 | Indeck et al. | May 2008 | A1 |
20090049335 | Khatri et al. | Feb 2009 | A1 |
20090106178 | Chu | Apr 2009 | A1 |
20090177692 | Chagoly | Jul 2009 | A1 |
20090254312 | Kube | Oct 2009 | A1 |
20110078106 | Luchi | Mar 2011 | A1 |
20110145400 | Dodson | Jun 2011 | A1 |
20120137367 | Dupont et al. | May 2012 | A1 |
20120296974 | Tabe | Nov 2012 | A1 |
20130031130 | Hahm | Jan 2013 | A1 |
20130198206 | Jones | Aug 2013 | A1 |
20130238476 | Green | Sep 2013 | A1 |
20130262347 | Dodson | Oct 2013 | A1 |
20130326620 | Merza et al. | Dec 2013 | A1 |
20140006330 | Biem | Jan 2014 | A1 |
20140108640 | Mathis | Apr 2014 | A1 |
20150058982 | Eskin et al. | Feb 2015 | A1 |
20150082437 | Dodson | Mar 2015 | A1 |
20150180894 | Sadovsky et al. | Jun 2015 | A1 |
20150235312 | Dodson | Aug 2015 | A1 |
20150302310 | Wernevi | Oct 2015 | A1 |
20160055654 | Flanders et al. | Feb 2016 | A1 |
20160226901 | Baikalov et al. | Aug 2016 | A1 |
20170063910 | Muddu et al. | Mar 2017 | A1 |
20170148096 | Dodson | May 2017 | A1 |
20180314835 | Dodson et al. | Nov 2018 | A1 |
20180316707 | Dodson et al. | Nov 2018 | A1 |
Number | Date | Country |
---|---|---|
5142098 | Jun 1998 | AU |
9713153 | Feb 2000 | BR |
2272609 | Jun 1998 | CA |
2360590 | Aug 2011 | EP |
3616096 | Mar 2020 | EP |
WO1998024222 | Jun 1998 | WO |
WO2012154657 | Nov 2012 | WO |
WO2018200112 | Nov 2018 | WO |
WO2018200113 | Nov 2018 | WO |
Entry |
---|
Marina Meila Comparing Clusterings—An Information Based Distance Journal of Multivariate Analysis 98, 2007 pp. 873-895 (Year: 2006). |
Spiliopoulou et al. MONIC—Modeling and Monitoring Cluster Transitions KDD'06, Aug. 2006, ACM (Year: 2006). |
Extended European Search Report dated Sep. 22, 2011 in Application No. EP10194379.3, 6 pages. |
European Office Action dated Sep. 20, 2013 in Application No. 10194379.3, filed Aug. 24, 2011, 2 pages. |
Summons to Attend Oral Proceedings dated Oct. 7, 2015 in Application No. 10194379.3, filed Aug. 24, 2011. 9 pages. |
“International Search Report” and “Written Opinion of the International Searching Authority,” Patent Cooperation Treaty Application No. PCT/US2018/024660, dated Jun. 26, 2018, 9 pages. |
Domingues, Remi, “Machine Learning for Unsupervised Fraud Detection,” Royal Institute of Technology, School of computer Science and Communication, KTH CSC, SE-100 44 Stockholm, Sweden, 2015 [retrieved on May 21, 2018], Retrieved from the Internet: <URL:http://www.diva-portatorg/smash/get/diva2:897808/FULLTEXT01.pdf>, 66 pages. |
“International Search Report” and “Written Opinion of the International Searching Authority,” Patent Cooperation Treaty Application No. PCT/US2018/024671, dated Jul. 27, 2018, 15 pages. |
Chandola, Varun et al., “Anomaly Detection: A Survey,” ACM Computing Surveys, Aug. 15, 2007, pp. 1-72, Retrieved from the Internet: <URL:http://www.cs.umn.edu/sites/cs.umn.edu/tiles/tech_reports/07-017.pdf>. |
Soule et al., “Combining Filtering and Statistical Methods for Anomaly Detection”, USENIX, The Advanced Computing Systems Association, 24 Oct. 2005, pp. 1-14. |
“Extended European Search Report”, European Patent Application 18791513.7, dated Sep. 14, 2020, 9 pages. |
Domingues et al., “Machine Learning for Unsupervised Fraud Detection”, Stockholm, Sweden 2015, Kth Royal Institute of Technology School of Computer Science and Communication, Second Cycle (Year: 2015), 66 pages. |
Riveiro et al., “Improving maritime anomaly detection and situation awareness through interactive visualization”, IEEE Xplore, Sep. 26, 2008, 8 pages. |
Number | Date | Country | |
---|---|---|---|
20150339600 A1 | Nov 2015 | US |
Number | Date | Country | |
---|---|---|---|
62000609 | May 2014 | US |