The disclosure relates to data processing systems, including data mining systems. More particularly, the invention relates to monitoring, reporting, and anomaly detection systems, in which logs and their parameters are being the analyzed subject.
Computer logs are a useful source of information for monitoring the behavior of a computer or a system of computers. Logs (either in the form of log files or log data streams) are typically automatically generated text files listing timestamped computer hardware events, computer software events, or messages sent to or from a computer. In some cases, a system can generate a large number of log messages, distributed over several files or data streams, from multiple sources (different computers and/or different software applications). Therefore, computerized methods for datamining these log files need to be used to analyze the records.
Due to the nature of the logs, being automatically generated by software, the records follow patterns defined by the application generating the logs—similar events will be described in similar grammar and with a common set of keywords. For some logs, such as Apache Web Server logs, these patterns are commonly known and well defined. However, many logs will not have patterns known to the log reader ahead of time. Therefore, computerized methods for analyzing logs must have the ability to parse and understand all types of log patterns by specifying rules for parsing any given log format. Typically, this is done by manually defining parsing rules. For some systems, this requires significant manual effort.
There is a significant research on log clustering, however it is mostly based around grouping logs into sets based on their similarity. This is useful for determining generic classes of logs, but is not efficient for building descriptions of specific patterns.
US Patent Publication No. 2015/0154269A1 (filed as U.S. patent application Ser. No. 14/611,089) relates to formulating and refining field extraction rules that are used at query time on raw data with a late-binding schema. Specifically, it provides analysis tools and a wizard to allow a user without extensive programming experience or training to create one or more extraction rules that deliver data values from events in machine data. While this might make rulemaking easier, it is still a manual rulemaking system.
The systems and methods described herein analyze and parse logs and determine patterns automatically, even if the patterns are not well defined. These systems and methods cluster the logs, retaining the order of tokens and parameters, and expressing the identified log patterns in a trie, allowing for automatic matching of patterns for incoming logs.
A first embodiment of this disclosure includes a computer system including a processor and a datastore, the system comprising: a log processing engine connected to the datastore and configured to: collect logs from a plurality of applications; tokenize the logs; match each record of the logs, from their tokens, to a pattern in a stored trie, each pattern having a unique pattern ID; extract free parameters and metadata from the logs; and store the logs to the datastore as combinations of the pattern IDs, the free parameters, and the metadata.
A second embodiment of this disclosure includes a computer-based method for storing computer logs, the method comprising: collecting logs from a plurality of applications; tokenizing, by a processor, the logs; matching, by the processor, each record of the logs, from their tokens, to a pattern in a stored trie, each pattern having a unique pattern ID; extracting, by the processor, free parameters and metadata from the logs; and storing the logs to the datastore as combinations of the pattern IDs, the free parameters, and the metadata.
Various embodiments of the invention are disclosed in the following detailed description and the accompanying drawings.
Two problems faced when implementing such a system are storing all of the logs, which for systems with multiple processors and multiple applications can require significant storage space, and efficiently making queries or performing analytics on the resulting logs. Both of these issues can be addressed by utilizing trie data structures for the logs. A trie, also known as a “digital tree” or “radix tree” or “prefix tree”, is an ordered data structure that can be used to store an associative array of keys. The position of each node in the tree defines the key with which it is associated. Herein the positions of the nodes are identified by their “pattern IDs”, which are assigned as the nodes are created. Trie structures offer faster querying speeds than binary search trees in the worst case scenario.
However, there are now two possible sequences (i.e. patterns) that do match the log record (330): (1) the sequence [300, 301, 302] (302) and (2) the sequence [310, 311, 312] (312). However, the sequences are not equal in terms of required datastore. The first sequence (302) contains three wildcards (300, 301, and 302), so storing the record (330) as the first sequence (302) would require storing three parameters. In contrast, the second sequence (312) has no wildcards, so no parameters need to be stored: the entire record can be recovered just using the pattern ID (312) and the related metadata. Since the second pattern (312) has lower datastore requirements, the system can be structured to prefer the second pattern (312) over the first pattern (302) for storing the log. If the datastore requirements are equal for multiple patterns, the determination can be arbitrary or based off some other predefined criteria, such as selecting the first discovered pattern among all equally datastore intensive patterns.
As the number of alternative matching tries might be significant it is proposed to provide measures to limit maximum scope of searched tries. One possible solution is to use a windowed approach. Each window could hold the path with any match (to always have at least one pattern) and N best (so far) matches. With N equal to 256, at least eight levels are considered (as for each node there might exist any wildcard or exact match only). The system can start with a limited set of log patterns. The system can contain some pre-trained patterns for expediency, but in the minimal case it can just contain sets of “match-any” patterns (i.e. sequences of only wildcard nodes), based on which the specific patterns might then be trained by pattern discovery.
This is presented in more detail in
The data involving log patterns can be saved to three separate stores (650, 660,670). The first of those stores can be an in-memory database of log tries (650), which can be also persisted to disk storage. The trie database should not normally require a significant amount of memory space and is frequently accessed during the matching process, so fast memory access is preferred. The trie database structure is only changed during the discovery process. The second store contains specific pattern occurrences, together with its metadata (660). The last datastore (670) contains the free parameters. Stores 660 and 670 can effectively use columnar-oriented datastore means, such as Apache Parquet™ or Apache Kudu™.
Storing data in a trie format not only reduces the storage requirement (thus also making seek times faster), it also allows the use of new data analysis approaches, as each pattern might be considered a specific kind of event. A sample chart is presented in
An example of query execution is presented in
For this example, the timestamp is designated by a special node (1010) that accepts any timestamp—like a wildcard, but with format limitations. In some embodiments, there may be several special case tokens and, therefore, several specific types of nodes for those special cases. One example is a timestamp. The timestamp might be constituted by characters that would normally be extracted to more than one token; however the system can be made aware of several common timestamp formats and consider this as a special type of token (and a special type of node) such that the entire timestamp is extracted to one token. An alternative could be to parse the timestamp from the log, associate the timestamp with log metadata, and remove the timestamp characters from the actual saved log content.
Similarly, in some embodiments there can be options for handling special characters. One example is shown with the quotation marks as used in the example log for
The method with which the tokenization process is run will have a large impact on the number of parameters and their length in the trie structure. There are several possible methods.
In some embodiments, a more advanced approach to the tokenization process can include more special characters (like: / \ ’ { } [ ]:”, !) as log string delimiters. In such an approach, the tokenization process will still be well defined. Tokens can sometimes be much shorter, and if so the number of possible tokens will be bigger. This is an important fact because the trie log structure memory requirements will vary significantly depending on the choice of delimiter character.
Sometimes, for the best results regarding optimization and performance, it may be best to define more than one tokenizer. In some embodiments a message log can be categorized by an adaptive tokenizer (1301). The selector selects the best tokenizer (1310, 1320, 1330) for that particular log message, as seen in
In order to create proper patterns for such cases it is very helpful to define types of node in log trie structure. Possible types of nodes include:
A number of embodiments of the disclosure have been described. Nevertheless, it will be understood that various modifications may be made without departing from the spirit and scope of the present disclosure. Accordingly, other embodiments are within the scope of the following claims.
The examples set forth above are provided to those of ordinary skill in the art as a complete disclosure and description of how to make and use the embodiments of the disclosure, and are not intended to limit the scope of what the inventor/inventors regard as their disclosure.
Modifications of the above-described modes for carrying out the methods and systems herein disclosed that are obvious to persons of skill in the art are intended to be within the scope of the following claims. All patents and publications mentioned in the specification are indicative of the levels of skill of those skilled in the art to which the disclosure pertains. All references cited in this disclosure are incorporated by reference to the same extent as if each reference had been incorporated by reference in its entirety individually.
It is to be understood that the disclosure is not limited to particular methods or systems, which can, of course, vary. It is also to be understood that the terminology used herein is for the purpose of describing particular embodiments only, and is not intended to be limiting. As used in this specification and the appended claims, the singular forms “a,” “an,” and “the” include plural referents unless the content clearly dictates otherwise. The term “plurality” includes two or more referents unless the content clearly dictates otherwise. Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which the disclosure pertains.
This application is a continuation of U.S. patent application Ser. No. 15/237,559, entitled SYSTEMS AND METHODS FOR TRIE-BASED AUTOMATED DISCOVERY OF PATTERNS IN COMPUTER LOGS filed Aug. 15, 2016 which is incorporated herein by reference for all purposes.
Number | Name | Date | Kind |
---|---|---|---|
6614789 | Yazdani | Sep 2003 | B1 |
6785677 | Fritchman | Aug 2004 | B1 |
6970971 | Warkhede | Nov 2005 | B1 |
7929534 | Poletto | Apr 2011 | B2 |
8032489 | Villella | Oct 2011 | B2 |
8271403 | Rieck | Sep 2012 | B2 |
8380758 | Stephens | Feb 2013 | B1 |
8495429 | Fu | Jul 2013 | B2 |
8554907 | Chen | Oct 2013 | B1 |
8626696 | Lambov | Jan 2014 | B2 |
8630989 | Blohm | Jan 2014 | B2 |
8949251 | Thomas | Feb 2015 | B2 |
8949418 | Bray | Feb 2015 | B2 |
9075718 | Hinterbichler | Jul 2015 | B2 |
9166993 | Liu | Oct 2015 | B1 |
9479479 | Tulasi | Oct 2016 | B1 |
9497206 | Bernstein | Nov 2016 | B2 |
9552249 | James | Jan 2017 | B1 |
10021026 | Shen | Jul 2018 | B2 |
10038710 | Bersch | Jul 2018 | B2 |
10055439 | Futamura | Aug 2018 | B2 |
10866972 | Maciolek et al. | Dec 2020 | B2 |
20040133672 | Bhattacharya | Jul 2004 | A1 |
20080222706 | Renaud | Sep 2008 | A1 |
20100223499 | Panigrahy | Sep 2010 | A1 |
20140208412 | Bray | Jul 2014 | A1 |
20140369209 | Khurshid | Dec 2014 | A1 |
20150154269 | Miller | Jun 2015 | A1 |
20150222477 | Ranjan | Aug 2015 | A1 |
20150293920 | Kanjirathinkal | Oct 2015 | A1 |
20160182486 | Wu | Jun 2016 | A1 |
20160197952 | Fujimoto | Jul 2016 | A1 |
20160292599 | Andrews | Oct 2016 | A1 |
20170126534 | Cimino | May 2017 | A1 |
20170180403 | Mehta | Jun 2017 | A1 |
20170249200 | Mustafi | Aug 2017 | A1 |
20170324759 | Puri | Nov 2017 | A1 |
20180046697 | Maciolek et al. | Feb 2018 | A1 |
20190243835 | Johnson | Aug 2019 | A1 |
20200067912 | Block | Feb 2020 | A1 |
20210144089 | Narayanan | May 2021 | A1 |
Entry |
---|
“U.S. Appl. No. 15/237,559, Non-Final Office Action mailed Sep. 7, 2018”, 18 pgs. |
“U.S. Appl. No. 15/237,559, Response filed Mar. 6, 2019 to Non Final Office Action mailed Sep. 7, 2018”, 9 pgs. |
“U.S. Appl. No. 15/237,559, Final Office Action mailed May 14, 2019”, 11 pgs. |
“U.S. Appl. No. 15/237,559, Response filed Aug. 13, 2019 to Final Office Action mailed May 14, 2019”, 10 pgs. |
“U.S. Appl. No. 15/237,559, Non-Final Office Action mailed Oct. 1, 2019”, 12 pgs. |
“U.S. Appl. No. 15/237,559, Response filed Nov. 29, 2019 to Non-Final Office Action mailed Oct. 1, 2019”, 5 pgs. |
“U.S. Appl. No. 15/237,559, Final Office Action mailed Jan. 22, 2020”, 19 pgs. |
“U.S. Appl. No. 15/237,559, Response filed Jun. 22, 2020 to Final Office Action mailed Jan. 22, 2020”, 7 pgs. |
“U.S. Appl. No. 15/237,559, Notice of Allowance mailed Aug. 20, 2020”, 14 pgs. |
Number | Date | Country | |
---|---|---|---|
20210081437 A1 | Mar 2021 | US |
Number | Date | Country | |
---|---|---|---|
Parent | 15237559 | Aug 2016 | US |
Child | 17098170 | US |