The present disclosure generally relates to data processing methods.
Since the invention of computer, the processing power of computer systems continues to improve, more or less following the famous Moore's Law. With ever increasing computing power, more and more data intensive applications are being developed. It is now not uncommon to see a database that stores many billions records running into peta bytes of data storage. Often, it is desirable to quickly analyze the data to obtain interesting information.
The present disclosure generally relates to data processing. Bloom filters are used to process data at high speed. A Bloom filter that is initialized based on a source string can be used to quickly determine the similarity between the source string and a query string.
It is often useful to determine whether a query string, or a segment of the query string, is present in a source file. For example, there are many real world applications in determining whether a specific pattern exists in an input file, such as classification based on pattern matching, searching, and others. To make this kind of determination, it is typically necessary to search through the entire source file for the input string, which is computationally expensive and slow. For example, matching strings of information often involves template matching and matching time-series embedding generated by non-linear dynamical systems. The time-series is typically converted into a frequency domain via Fourier methods, and the power spectra of the source time-series and query time-series are compared via likelihood ratio methods/density methods and the Corr-entropy measures. These methods are often not fast enough for a variety of applications.
To make similarity determinations, it often involves template matching and matching time-series embedding generated by a non-linear dynamical systems. Also the time-series is typically converted into a frequency domain via Fourier methods, and the power spectra of the source time-series and query time-series are compared via likelihood ratio methods/density methods and the corr-entropy measures. These methods are often slow and computationally expensive. Thus, it is desirable to have methods and systems for quickly determining whether a source file contains a substring.
The present disclosure describes techniques for determining whether a source file contains a substring using Bloom filters. Bloom filters are generated by processing the source file using hash functions. To determine whether a source file contains a substring, a Bloom filter (which can be much smaller than the source file in size) based on the source is used, thereby allowing for quick determination. It is be appreciated that in many applications, where speed at which an answered is provided is more important than 100% certainty, techniques described in the present disclosure can provide advantages in computational speed and cost.
As an example, a source string of binary values of a given length is compared with a query string (of possibly unequal length) to determine whether or not the source string is similar. Furthermore, it is possible to determine a substring in the source string is in alignment/matched to the query string. The process is light-weight and can be implemented on-line in real-time.
The use of Bloom filters is conceived and developed by Burton Howard Bloom. Bloom filter is a data structure can be used to test whether an element is a member of a set. Often used for making approximation, Bloom filters sometimes turn out false positives, but never false negatives. The accuracy of a given Bloom filters depends on various factors, which can be adjusted according to the needs of particular applications.
To use a Bloom filter, a Bloom filter needs to be initialized. For example, an empty Bloom filter having m bits (all set to 0 initially) is to be initialized by k hash functions that maps information into the Bloom filter. Once the Bloom filter is initialized based on a source file, the Bloom filter can be used to determine whether a query string exists in the source file.
As an example, a use case involves two bit strings of arbitrary length, one is a source string and the other is a query string. The goal is to have an output that is a decision informing the user if there is a substring (of a given size) of the query string in the substring. For example, the source string is as follows:
Source string: 010100001110101010000011110000
And the query string is as follows:
Query string: 1000110000111101010101
Similarities between the source string and the query string are to be determined. For example, a measurement of similarity can be based on whether there is a substring of a given size of the query string in the source string. The size of the substring is referred to as “window size.” For example, for a window size of 3 bits, the source string and query string are similar, as both strings contain a substring of “100”. On the other hand, for the window size that equals to the length of the query string, the source string and the query string are not similar. Depending on the application, different window size can be selected.
It is to be appreciated that using two substrings, substrings A and B, improves accuracy of the Bloom filter. Due to possible false positives of Bloom filters, errors are possible in characterizing membership of a query string using only the substring A. But the possibility of false probability is reduced by also checking membership of the substring (B). To improve accuracy, additional substrings that are segments of the substring A can be used for checking membership.
The Bloom filter initialized in
And the query string consists of the following characters:
The source string and the query string may be deemed similar for a WindowSize of 10. That is, there is a 10-character segment of the query string that is similar to the same to a corresponding 10-character segment of the source string.
The accuracy and computational speed for determining similarity depend on the parameters used. For example, by choosing the parameter WindowSize to be close to 30%-50% of the query string, the number of comparisons to be performed is greatly reduced compared to comparing the entire query string. A way to make the determination is to find if there is a substring in the source string that is common to a segment of the query string. The length of the substring segment can be defined as WINDOW SIZE. Using the techniques described above, the determination can be made in sub-quadratic in length of the source string A.
As described above, the accuracy and speed of similarity determinations depend on various parameters, which can be set by the user. To provide a Bloom filter, the parameter WindowSize (i.e., substring size) is selected, and substrings of size WindowSize in the source string are processed by k hash functions. The process can be performed by traversing the source string once and extracting substring (j, windowSize). For example, the following pseudo code is used:
For example, once the Bloom filter is initialized, it can be used to determine whether a query substring of size WindowSize is a member of the Bloom filter. As explained above, it is to have false positive determinations using Bloom filters false. To reduce false positives, the following substrings can be hashed:
1. hash A.substr(j, WindowSize-STEPSIZE)
2. hash A.substr(j, WindowSize-2*STEPSIZE)
3. hash A.substr(j, WindowSize-3*STEPSIZE)
For example, in
Similarly, the sizes of substrings C and D in
It is to be appreciated that various processes described above can be performed very quickly. For example, let us assume the LA≈LB (i.e., lengths the substrings A and B are almost equal) and WindowSize is more than 50% of the query string. It is O(LB) under those conditions. It is sub-quadratic even other wise since we no longer are trying to find all substrings.
In an example, a 128 MB size Bloom filter is initialized with 8 hash functions (i.e., k=8). A source string of size 272664 with window size of 1000 bits is hashed. To test the speed for calculation, a query string of size 2403 bits is tested to see if there is a substring of 1000 bits common to both. The process was completed within twenty seconds, which is faster than other methods.
In another example, a negative test is performed. A 128 MB bitmap for the Bloom filter is allocated, and source string of size 272664 with window size of 1000 bits is used. The query string has a size of 272664 bits and is initialized to all zeros. In this example, it took 14-15 seconds to answer the question for a window size of 272000 bits. The time needed to perform calculation depends on the window size. For example, if the window size is 265000 bits, then similarity test took 227 seconds. When the window size is changed to 269000, it took 77 seconds for the similarity test, and 73 seconds were needed to hash into the Bloom filter.
As can be seen from above, hashing the strings and substrings is an important process, and it often takes a lot of computational powers. The lengths of strings can be reduced to improve performance, as the amount of computation needed for processing the strings is reduced with reduction of lengths of strings. Since a string of signal is typically a bit string of 0's and 1's, a compression scheme can be used to encode of the signal. In this compression scheme, a series of 0's is compressed to a frequency count followed by symbol (0 or 1). For example string 111100000011100 is encoded as 41603120 (i.e., four 1's, six 0's, three 1's, two 0's, etc.). When we traverse the signal, we drop the 1st symbol and add a new symbol (0 or 1) at the end, as illustrated in the example below:
[1]11100000011100→11100000011100→11100000011100[0];
And under the compressed scheme, the encoding “41603120” becomes the encoding “31603130”. For example, a subroutine is used to perform the conversion, and the compressed string is fed into hashing routines.
To provide a comparison, with 128 MB bitmap for the Bloom filter and source string of size 272664 with window size of 269000, it took 77 seconds for the similarity test, and 73 seconds were needed to hash into the Bloom filter. With compressed signals, for window size of 250000, it took 57 seconds for the negative test; for window size 269000, the elapsed time is 10 seconds.
It is to be appreciated that the methods and processes described above can be implemented using various types of computing system, and the algorithm can be stored on various types of computer readable mediums.
The methods described in the present disclosure can be used for various applications. For example, by determining the similarity between a source string and a query string, it is possible to quickly determine whether the query string should be classified in the same category as the source string. A query string may be in the form of a signal string. Using a Bloom filter to determine whether a section (or the entirety) of signal string is similar to a section of a source string. The use of Bloom filter allows for quick determination at relatively high certainty. In addition, since the query string is compared to the Bloom filter instead of the source string or source file itself, the amount of memory access to the source file is reduced. These techniques can be applied in different domains, such as real-time streaming data, text mining, healthcare applications, and many others.
While the above is a full description of the specific embodiments, various modifications, alternative constructions and equivalents may be used. Therefore, the above description and illustrations should not be taken as limiting the scope of the present invention which is defined by the appended claims. It should be understood that the description recited above is an example of the invention and that modifications and changes to the examples may be undertaken which are within the scope of the claimed invention. Therefore, the scope of the appended claims should be accorded the broadest interpretation so as to encompass all such modifications and similar arrangements, including a full scope of equivalents.