Typically, enterprise storage environments designed for large-scale, high-technology environments of modern enterprises involve the storage of large amounts of historical log data. The log data may be searched for a variety of occurrences of query information related to a search query. For example, the log data may be searched for the occurrence of a particular Internet protocol (IP) address, or a host name. The search query for the query information may include a time range associated therewith. For example, the search query may include a time range for the past ten minutes, the past six months, etc., associated therewith.
Features of the present disclosure are illustrated by way of example and not limited in the following figure(s), in which like numerals indicate like elements, in which:
For simplicity and illustrative purposes, the present disclosure is described by referring mainly to examples. In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present disclosure. It will be readily apparent however, that the present disclosure may be practiced without limitation to these specific details. In other instances, some methods and structures have not been described in detail so as not to unnecessarily obscure the present disclosure.
Throughout the present disclosure, the terms “a” and “an” are intended to denote at least one of a particular element. As used herein, the term “includes” means includes but not limited to, the term “including” means including but not limited to. The term “based on” means based at least in part on.
In environments, such as, enterprise storage environments that involve the storage of large amounts of historical log data, the log data may be searched for the occurrence of query information related to a search query, for example, by checking each log message of the log data individually. The time and resource utilization for a search may be reduced, for example, by limiting the search to a time range. However, absent further elimination of log data that needs to be searched, reduction of any further time and resource utilization related to the search may be limited.
According to examples, a bloom filter based log data analysis apparatus and a method for bloom filter based log data analysis are disclosed herein. The apparatus and method disclosed herein may provide for a search operation related to the log data to rule out data ranges of the log data that definitely do not contain the query information related to a search query through the use of bloom filters. The data ranges of the log data may be related, for example, to time-based ranges of the log data. For example, the data ranges of the log data may be based on log data from a ten minute range, a six hour range, etc., of the log data. Alternatively or additionally, the data ranges of the log data may be based on a number of log data messages associated with the log data, or other aspects that may be used to divide the log data as needed. Compared, for example, to the log data, a bloom filter may take up a relatively small amount of memory storage space. Further, a bloom filter may be checked relatively quickly to determine if the bloom filter contains a particular query information related to a search query.
The bloom filter may determine that a particular log data information (e.g., an IP address, host name, etc.) was probably added with a quantifiable false positive rate. Further, the bloom filter may determine that a particular log data information was definitely not added, without any chance of a false negative result. By accepting the occasional false positive result from the bloom filter as unneeded effort, search speeds related to searching of the log data may be increased for queries with few or no results since large ranges of the log data may be ruled out by the bloom filters. Thus, by eliminating data ranges of the log data that definitely do not include any search results related to a search query, the apparatus and method disclosed herein may limit searching to ranges of the log data that are known, with a predetermined measure of certainty, to contain relevant results related to the query information. For queries with zero results, the overall search speed may be constant, since all of the log data may be eliminated from containing search results.
The generation of the bloom filters as the log data is received may add a relatively small amount of overhead (i.e., bloom filter data) due to the typical nature of the log data being tracked. Further, the storage of the bloom filter data may be generally negligible in comparison to the storage of the log data. Therefore, with the use of the bloom filters, the apparatus and method disclosed herein may efficiently search the log data for query information.
Referring to
The pre-computed hash generation module 106 may ascertain information related to a longest storage group retention timeframe for a storage group including a predetermined number of the data ranges for the particular log data information 112, and generate the master bloom filter 114 based on the longest storage group retention timeframe. In this manner, the master bloom filter 114 may stay current as to a predetermined number of the data ranges for the particular log data information 112.
The pre-computed hash values 110 may be computed for each of the different data range based bloom filters 104 for each log data information 112 per data range of the log data information 112, and for the corresponding master bloom filter 114. Alternatively or additionally, the pre-computed hash values 110 computed for each of the different data range based bloom filters 104 for each log data information 112 per data range of the log data information 112 may be used to compute the pre-computed hash values 110 for the corresponding master bloom filter 114.
The pre-computed hash generation module 106 may support linear combinations of the pre-computed hash values. For example, instead of computing a hash a plurality (e.g., fifteen) times, the hash may be computed twice and combined to obtain the needed hash values for the data range based bloom filter 104 and/or the master bloom filter 114. For example, for an input x for a bloom filter of size m bits, two hash values for the input x may be computed, named h1 and h2. In order to derive all the needed k bloom filter hash values b1, b2, b3 . . . bk, b1=(h1+(i*h2)) mod m may be computed.
Referring to
The query processing module 116 may first evaluate the pre-computed hash values 110 related to the log data information 112 for the master bloom filter 114. If the pre-computed hash values 110 related to the log data information 112 for the master bloom filter 114 indicate that the log data information 112 has not been received (i.e., the query information 120 is not present in the log data 108), the query processing module 116 may perform no further analysis of the pre-computed hash values 110, and report the results to a log message data analysis module 122.
If the pre-computed hash values 110 related to the log data information 112 for the master bloom filter 114 indicate that the log data information 112 may likely have been received (i.e., the query information 120 may likely be present in the log data 108), the query processing module 116 may further evaluate the pre-computed hash values 110 related to the log data information 112 for each of the different data range based bloom filters 104 for the specific data range specified in the query 118.
If the pre-computed hash values 110 related to the log data information 112 for all of the different data range based bloom filters 104 for the specific data range specified in the query 118 indicate that the log data information 112 has not been received (i.e., the query information 120 is not present in the log data 108 for the data ranges corresponding to the different data range based bloom filters 104), the query processing module 116 may report the results to the log message data analysis module 122.
Further, if the pre-computed hash values 110 related to the log data information 112 for any of the different data range based bloom filters 104 for the specific data range specified in the query 118 indicate that the log data information 112 may likely have been received (i.e., the query information 120 may likely be present in the log data 108 for the data ranges corresponding to the different data range based bloom filters 104), the query processing module 116 may report the results to the log message data analysis module 122.
The log message data analysis module 122 may further evaluate the log data 108 based on the determination by the query processing module 116. For example, based on the determination by the query processing module 116 that the query information 120 is likely to be present in the log data 108, the log message data analysis module 122 may further evaluate the log data 108 to confirm presence of the query information 120. For example, the log message data analysis module 122 may further evaluate the specific data ranges of the log data 108 where the query processing module 116 indicates presence of the query information 120 to confirm presence of the query information 120. For any data ranges of the log data 108 that are determined by the query processing module 116 to definitely not include the query information 120, these data ranges may be eliminated by the log message data analysis module 122 from further evaluation. Similarly, if the master bloom filter 114 is determined not to include the query information 120 by the query processing module 116, the log message data analysis module 122 may report results 124 of the analysis to a user of the bloom filter based log data analysis apparatus 100, without further analysis of any of the log data 108.
The modules and other elements of the apparatus 100 may be machine readable instructions stored on a non-transitory computer readable medium. In addition, or alternatively, the modules and other elements of the apparatus 100 may be hardware or a combination of machine readable instructions and hardware.
The data range based bloom filter 104 and/or the master bloom filter 114 may report false positives with a predictable probability as discussed above with reference to
The pre-computed hash values 110 for the data range based bloom filters 104 related to the specified data range may be stored adjacent to the log data 108 for the particular data range. This may provide for the application of the same archiving, retention, and storage limits and/or policies to the pre-computed hash values 110 and the log data 108. For example, when the log data 108 falls outside a retention period, the log data 108 and associated pre-computed hash values 110 may be deleted, for example, to avoid unneeded storage of the pre-computed hash values 110. The pre-computed hash values 110 for the master bloom filter 114 may be stored separately from the log data 108. This may provide for application of storage group limits to the pre-computed hash values 110 for the master bloom filter 114.
The data range based bloom filters 104 may also track a number of log messages (or other distinct values) for the log data 108 that are contained in the data ranges associated with the data range based bloom filters 104. The tracked number of log messages may be used to determine a number of the log messages or other events scanned by the query processing module 116 and/or the log message data analysis module 122. Further, the number of log messages that are eliminated by the data range based bloom filters 104 and/or the master bloom filter 114 may also be added to the number of log messages that are actually scanned by the query processing module 116 and/or the log message data analysis module 122 to determine a total amount of the log messages or other events that are subject to the query 118. The total amount of the log messages or other events that are subject to the query 118 may be used to confirm whether all of the appropriate log data 108 has been evaluated. For example, in the event of an error in the evaluation of the log data 108, for example, due to an unexpected event, the number of log messages for a given data range of the log data 108 may be compared to the total number of the log data 108 that has been evaluated by the query processing module 116 and/or the log message data analysis module 122 to confirm that all of the log data in the given data range has been evaluated (i.e., some of the log data 108 has not been inadvertently omitted from evaluation).
The bloom filter specification module 102 may also specify characteristics for scaling a plurality of the data range based bloom filters 104. For such scaled data range based bloom filters 104, the pre-computed hash generation module 106 may generate corresponding pre-computed hash values 110 that are also scaled. The scaled pre-computed hash values 110 may be used by the query processing module 116 in a similar manner as the pre-computed hash values 110 that do not include scaling, except that the scaled pre-computed hash values 110 may be used to evaluate corresponding scaled data range based bloom filters 104 (i.e., data range based bloom filters 104 with similar parameters, such as, bits, as the scaled pre-computed hash values 110).
With respect to scaling of a plurality of the data range based bloom filters 104, the when a bloom filter reaches a specified number of elements (e.g., 1000 elements), a further bloom filter that holds, for example, twice, or another predetermined number of elements, may be added. Similarly, further bloom filters may be added as needed once existing bloom filters reach a specified number of elements.
Referring to
At block 904, the method may include receiving log data 108.
At block 906, the method may include pre-computing hash values 110 related to log data information 112 from the log data 108 to generate the data range based bloom filter 104 based on the specified characteristics. According to an example, the data range based bloom filter 104 may correspond to a data range of the log data 108. According to an example, the method may include pre-computing the hash values related to the log data information 112 from the log data 108 to generate a plurality of data range based bloom filters that include the data range based bloom filter based on the specified characteristics. According to an example, the plurality of data range based bloom filters may correspond to a plurality of data ranges that include the data range of the log data 108.
At block 908, the method may include using the pre-computed hash values 110 to generate a master bloom filter 114 for the log data information 112 for a predetermined amount of the log data 108. According to an example, the predetermined amount of the log data 108 may be greater than the data range of the log data 108.
At block 910, the method may include receiving query information 120 to be searched in the log data 108.
At block 912, the method may include computing a hash value related to the query information 120.
At block 914, the method may include comparing the hash value related to the query information 120 to the pre-computed hash values 110 related to the master bloom filter 114 to determine whether the query information 120 is likely to be present in the log data 108 or whether the query information 120 is not present in the log data 108. According to an example, in response to a determination that the query information 120 is likely to be present in the log data 108, the method may include comparing the hash value related to the query information 120 to the pre-computed hash values 110 related to the data range based bloom filter 104 to determine whether the query information 120 is likely to be present in the data range of the log data 108 or whether the query information 120 is not present in the data range of the log data 108. According to an example, in response to a determination that the query information 120 is not present in the log data 108, the method may include stopping further evaluation of the log data 108. According to an example, in response to a determination that the query information 120 is not present in the data range of the log data 108, the method may include stopping further evaluation of the data range of the log data 108. According to an example, in response to a determination that the query information 120 is likely to be present in the data range of the log data 108, the method may include evaluating the log data 108 to confirm presence of the query information 120 in the log data 108.
Referring to
At block 1004, the method may include receiving log data 108.
At block 1006, the method may include pre-computing hash values 110 related to log data information 112 from the log data 108 to generate the data range based bloom filters based on the specified characteristics. According to an example, the data range based bloom filters may correspond to a plurality of data ranges of the log data 108.
At block 1008, the method may include pre-computing further hash values (e.g., further hash values 110) related to the log data information 112 from the log data 108 to generate a master bloom filter 114 for the log data information 112 for a predetermined amount of the log data 108. The predetermined amount of the log data 108 may be greater than a total of the plurality of data ranges of the log data 108.
At block 1010, the method may include receiving query information 120 to be searched in the log data 108.
At block 1012, the method may include computing a hash value related to the query information 120.
At block 1014, the method may include comparing the hash value related to the query information 120 to the pre-computed further hash values 110 related to the master bloom filter 114 to determine whether the query information 120 is likely to be present in the log data 108 or whether the query information 120 is not present in the log data 108. According to an example, in response to a determination that the query information 120 is likely to be present in the log data 108, the method may include comparing the hash value related to the query information 120 to pre-computed hash values 110 related to an appropriate additional data range based bloom filter of the additional data range based bloom filters to determine whether the query information 120 is likely to be present in the data range of the log data 108 corresponding to the appropriate additional data range based bloom filter or whether the query information 210 is not present in the data range of the log data 108 corresponding to the appropriate additional data range based bloom filter.
According to an example, the method may include scaling the data range based bloom filters 104 by adding additional data range based bloom filters once existing data range based bloom filters are filled to a predetermined capacity related to the specified characteristics.
According to an example, the method may include specifying characteristics of a data range based bloom filter 104. The characteristics may include a size of the data range based bloom filter 104 and an acceptable false positive rate associated with the data range based bloom filter 104. The method may include receiving data (e.g., the log data 108, or other data), and pre-computing hash values related to data information (e.g., the log data information 112, or other data information) from the data to generate the data range based bloom filter 104 based on the specified characteristics. The data range based bloom filter 104 may correspond to a data range of the data. The method may include receiving query information 120 to be searched in the data, computing a hash value related to the query information 120, and comparing the hash value related to the query information 120 to the pre-computed hash values related to the data range based bloom filter 104 to determine whether the query information 120 is likely to be present in the data or whether the query information 120 is not present in the data. According to an example, a time for the comparison may be independent of a number of elements in the data range for the data that are to be searched for the query information 120.
According to an example, the method may include evaluating the data to confirm presence of the query information 120 in the data.
The computer system 1100 may include a processor 1102 that may implement or execute machine readable instructions performing some or all of the methods, functions and other processes described herein. Commands and data from the processor 1102 may be communicated over a communication bus 1104. The computer system may also include a main memory 1106, such as a random access memory (RAM), where the machine readable instructions and data for the processor 1102 may reside during runtime, and a secondary data storage 1108, which may be non-volatile and stores machine readable instructions and data. The memory and data storage are examples of computer readable mediums. The memory 1106 may include a bloom filter based log data analysis module 1120 including machine readable instructions residing in the memory 1106 during runtime and executed by the processor 1102. The bloom filter based log data analysis module 1120 may include the modules of the apparatus 100 shown in
The computer system 1100 may include an I/O device 1110, such as a keyboard, a mouse, a display, etc. The computer system may include a network interface 1112 for connecting to a network. Other known electronic components may be added or substituted in the computer system.
What has been described and illustrated herein is an example along with some of its variations. The terms, descriptions and figures used herein are set forth by way of illustration only and are not meant as limitations. Many variations are possible within the spirit and scope of the subject matter, which is intended to be defined by the following claims—and their equivalents—in which all terms are meant in their broadest reasonable sense unless otherwise indicated.
Filing Document | Filing Date | Country | Kind |
---|---|---|---|
PCT/US2014/012103 | 1/17/2014 | WO | 00 |