This application claims the priority benefit of Japanese Patent Application No. 2010-155333, filed on Jul. 8, 2010, the entire descriptions of which are incorporated herein by reference.
The present invention relates to a method and apparatus for calculating from digital sequences feature quantities that take similar values among similar digital sequences, such as electronic files.
There have been great demands for a capability of finding quickly and highly precisely similar files, as when removing redundant or overlapping data in storages or searching for similar files in PCs and servers. As one method for calculating feature quantities of electronic files and the like used in such search operations, a “fuzzy hashing” (also called “similarity hashing”) has been known.
The fuzzy hashing is characterized in that (1) it allows for similarity check among electronic files and (2) produced sizes of hash values are small and fixed. That is, (1) unlike ordinary hash functions which, when there is a change of even one bit to the content of a file, result in a significant change to a hash value, the fuzzy hashing produces a hash value that depends on a degree of change made to the file; and (2) it produces hash values of a fixed length, which is smaller than index information generated by common search engines.
Some known examples of conventional techniques associated with the fuzzy hashing include Patent Literature 1 and Non Patent Literature 1. Both of these methods determine a fuzzy hash by dividing a digital sequence such as an electronic file, applying an ordinary hash function to each of the divided pieces of data to calculate a hash value, and linking together the hash values obtained. With a fuzzy hash determined in this way, even if a part of a file is changed, the fuzzy hash will not change significantly because the hash values of the other unaltered, divided pieces of data remain unchanged. As a result, the fuzzy hashes of similar
Patent Literature 1: U.S. Pat. No. 7,272,602
Non Patent Literature 1: Jesse Kornblum: “Identifying almost identical files using context triggered piecewise hashing”, Digital Investigation 3S (2006) pp. 91-97.
The conventional techniques described in Patent Literature 1 and Non Patent Literature 1 both calculate a fuzzy hash in the following manner.
(Step 1) A digital sequence is scanned from its starting end one byte at a time and a predetermined operation is performed on scanned data strings near a current scanning point to calculate a value. This operation is carried out for each scanning point.
(Step 2) When a calculated value corresponding to a given scanning point exceeds a predetermined threshold, that scanning point is taken as a dividing point at which to divide the digital sequence.
(Step 3) When the scan has reached the tail end of the sequence, the number of divided pieces of data separated from one another (hereinafter referred to as the number of partitions) by the dividing points, is counted. To ensure that fuzzy hashes have a fixed length, the number of partitions must be close to a predetermined fixed value (hereinafter referred to as an output partition number). If the number of partitions is remote from the output partition number, the fuzzy hash calculation process adjusts the threshold before returning to step 2. If not, the process proceeds to step 4.
(Step 4) When a desired partition number is obtained, the process divides the digital sequence at these dividing points and calculates a hash value for each partition or divided pieces of data (hereinafter referred to as a “partition hash” to distinguish it from a fuzzy hash). The partition hashes thus obtained are linked together to produce a fuzzy hash.
That is, with the conventional technique it is necessary to adjust the threshold so that the partition number comes close to the output partition number. The reason that the file is not divided simply at equal intervals of a predetermined fixed length is that, if the digital sequence in a certain partition or divided piece of data is expanded even by 1 bit, as a result of editing or modification, the positions of dividing points in the sequence following that partition shift, resulting in a loss of match in divided position between the original sequence before the modification and the modified one, which in turn will cause the value of the fuzzy hash to change significantly.
However, calculating a fuzzy hash for a digital sequence that has been expanded to some extent by editing, based on the method described in Patent Literature 1 and Non Patent Literature 1, will highly likely produce a threshold that is different from the one used before editing because the conventional method attempts to adjust the threshold to make the partition number approximate the output partition number. Once the threshold is changed, the way in which the file is divided becomes drastically different, with the result that a fuzzy hash thus produced will no longer be near the value of the fuzzy hash of the original file.
In summary, the conventional technique has a problem that if a threshold is altered as a result of file expansion, a digital sequence similarity check can no longer be made correctly using fuzzy hashes.
To solve the aforementioned problem of the conventional techniques associated with a change in threshold, this specification discloses a method which, rather than adjusting the number of partitions by changing the threshold, divides a digital sequence with a variety of different thresholds to produce a set of partition hashes and outputs them in a number not exceeding the output partition number as a fuzzy hash. Since the fuzzy hash thus produced includes partition hashes of data pieces divided by a variety of different thresholds, even if the threshold is changed as a result of file modifications, as long as the changed threshold is included in a set of thresholds of the original file before modification, the two fuzzy hashes will not take drastically different values.
To describe in more detail, dividing points are determined by a threshold that produces the least number of partitions (the threshold is hereinafter referred to as a “level” which will be defined by referring to
With the conventional techniques described in Patent Literature 1 and Non Patent Literature 1, since only a set of partition hashes belonging to the lowest level is output as a fuzzy hash, if a file is modified resulting in a set of partition hashes at its lowest level being changed, a correct distance between two fuzzy hashes cannot be calculated.
To deal with this problem, the method of this invention first compares two sets of fuzzy hash levels and calculates a distance between two sets of partition hashes belonging to the lowest of common levels. Unlike the conventional techniques, this method compares the fuzzy hashes at the same level and therefore can correctly calculate the distance between them. This method is disclosed as a second aspect of this invention.
Finding common partition hashes at each level generally requires many computations. Therefore, taking advantage of the fact that the higher the level, the lower the likelihood will be of a dividing point occurring, this invention finds matching portions, starting from the highest level where the number of partitions is smallest and moving one level down at a time, to reduce as many partition hashes to be compared as possible to reduce the calculation volume. This method is disclosed as a third aspect of this invention.
Further, the conventional techniques have introduced a concept of a threshold to keep constant the output sizes of fuzzy hashes regardless of file size. The requirement of keeping the size of fuzzy hashes to a fixed length is in itself a restraint intended to avoid fuzzy hashes imposing onerous burden on the storage capacity. Thus there may be cases where this requirement may be excluded, as when the storage capacity is sufficiently larger than files to be stored. In that case, the size of feature quantity can be increased in proportion to the file size and, because of increased volume of information on the feature quantity, the similarity check accuracy can be expected to improve.
Therefore, the method of calculating a fuzzy hash whose output size depends on a file size, a similar file search method and an apparatus to implement these methods are disclosed as a fourth aspect of this invention that solves the problem with the conventional techniques.
The above aspect allows fuzzy hashes of even those files, for which similarity judgment cannot be made by the conventional techniques described in Patent Literature 1 and Non Patent Literature 1, to assume close values, raising the possibility of similarity judgment being made correctly. In more detail, this aspect makes it possible to search similar files in PCs and servers more precisely than the conventional techniques. Further, this aspect also enables redundant or overlapping portions in a file in a storage to be found more reliably. Erasing the overlapping or redundant portions before storing can reduce the storage capacity required more than can the conventional techniques.
This invention raises the possibility that a similarity judgment can be made correctly of even those files for which similarity judgment cannot be made by the conventional techniques. Other objects, features and advantages of this invention will become more apparent from the following descriptions taken in conjunction with the accompanying drawings.
Embodiments of the present invention will be described by referring to accompanying drawings.
The digital sequence feature amount calculation apparatus 10 is configured to have a storage 100 in which to store digital sequences such as electronic files and programs, a CPU 120 to perform a variety of computations, a memory 140 in which to temporarily store data for computation, and an input/output interface 160 for user dialog devices such as keyboard, mouse and display, all connected to an internal signal line 180 or hub. The storage 100 includes storage media such as hard drives, flash memories and RAIDs.
The digital sequence feature amount calculation apparatus 10 has, as in PCs and servers, a CPU 120 and a memory 140 and may be mounted as one function that runs on PCs and servers.
The storage 100 has processing units, such as a file storage unit 102, a fuzzy hash storage unit 104, a distance storage unit 106, a control unit 110, a fuzzy hash calculation unit 122, a distance calculation unit 124 and a file search unit 126.
In the digital sequence feature amount calculation apparatus 10, the file storage unit 102 stores electronic files, on which the user can perform operations, in low-level blocks that are managed by a block IO. In descriptions that follow, byte strings in electronic files stored in the file storage unit 102, together with a concept of blocks, are referred to as a “digital sequence.” Unless otherwise specifically noted, a word “electronic file” also implies a digital sequence.
The fuzzy hash calculation unit 122 calculates a fuzzy hash for an electronic file stored in the file storage unit 102. The calculated fuzzy hash is stored in the fuzzy hash storage unit 104. The distance calculation unit 124 calculates a similarity (distance) between files by using fuzzy hashes stored in the fuzzy hash storage unit 104 and stores it in the distance storage unit 106. The file search unit 126 looks for similar files by using distance information stored in the distance storage unit 106.
To ensure fast computation on similar files, the method of this embodiment calculates distances between fuzzy hashes in advance and stores them in the distance storage unit 106. The control unit 110 sends files stored in the file storage unit 102 successively to the fuzzy hash calculation unit 122 and also forwards the fuzzy hashes stored in the fuzzy hash storage unit 104 one after another to the distance calculation unit 124, thereby determining distances for all combinations of files and updating the distance storage unit 106.
The processing units 110, 122, 124, 126 in the storage 100 are implemented by the CPU 120 executing programs stored in the memory 140. The programs may be stored in the memory 140 beforehand or loaded into the memory 140 from other devices through the input/output interface 160 and media that can be used by the computer. The media, for example, refer to removable storage media that can be connected to or disconnected from the input/output interface, or communications media (e.g. wired, wireless or optical networks, or carrier waves and digital signals propagating on the networks).
The programs implementing these processing units 110, 122, 124, 126 may be stored in a read-only memory (ROM) not shown, rather than in the rewritable storage 100.
Now, referring to
The fuzzy hash calculation unit 122 has a file read unit 202 to read an electronic file from the file storage unit 102, a normalization unit 204 to eliminate information not necessary for the fuzzy hash calculation from the read file, a data dividing unit 206 to divide the normalized file, a partition hash calculation unit 208 to calculate a hash value for each of the divided data pieces and a fuzzy hash output unit 210 to output a set of the partition hashes obtained. The fuzzy hash calculation unit 122 also has an initial setting unit 200 to make settings such as parameters associated with the processing units 202, 204, 206, 208, 210.
In the calculation of fuzzy hashes, the processing units 200, 202, 204, 206, 208, 210 temporarily store data in the memory 140 for its checking, editing or removal.
The fuzzy hashes produced by the fuzzy hash calculation unit 122 are stored in the fuzzy hash storage unit 104 by the fuzzy hash output unit 210. Alternatively they may be presented to the user on a display through the input/output interface 160.
Before proceeding to give detailed explanation on operation of devices of
In this embodiment and the conventional technique, a fuzzy hash for a digital sequence 30 such as an electronic file is produced by scanning the digital sequence 30 from the starting point one byte at a time to extract a partition sequence 302 of K bytes beginning at a scan point 300. K is a small value, e.g. 7 in Non Patent Literature 1. The same value may also be taken in this embodiment.
Next, the data string 302 is fed into a hash function 32 to calculate a hash value 34. Patent Literature 1 and Non Patent Literature 1 adopt a fast hash function 32 called a “rolling hash.”
The reason why the rolling hash is employed as the hash function 32, rather than a method that, for example, simply adds up bytes in the data string 302, is that the latter simple method depends greatly on how bytes of the digital sequence 30 are arranged, giving rise to a possibility of similar hash values 34 recurring one after another. Since the dividing point is determined according to the hash value 34, as described later, if similar hash values recur successively, there is likely to be a bias to the arrangement of dividing points, i.e., the manner in which the sequence is divided. Because the fuzzy hash is produced by determining a partition hash for each of divided pieces of data and linking together the partition hashes, if the digital sequence is changed in only one portion and if the dividing points happen to concentrate in that portion, it will have a significant effect on the fuzzy hash. To get around this problem, the hash function 32 is used to divide the digital sequence at as equal an interval as possible. As described in a literature cited below, the rolling hash is known to be a function capable of hashing the values of a digital sequence at high speed. This is why the rolling hash is adopted by Patent Literature 1 and Non Patent Literature 1.
Richard M. Karp and Michael O. Rabin: “Pattern-matching algorithms”, IBM Journal of Research and Development, 31(2) pp. 249-260, 2987.
In the conventional method, a hash value 34 is calculated for each scan point 300 and a threshold is adjusted so that a predetermined number of partitions can be extracted using the hash values. To describe details of the method mentioned in Non Patent Literature 1, t least significant bits or endmost bits 340 are extracted from the hash value 34 of t_max bit and, if these extracted bits are all zeros, the scan point 300 is regarded as a dividing point. Here t_max refers to the number of bits required to represent a maximum possible value that the hash value 34 can take. The rolling hash in Non Patent Literature 1 produces a 32-bit hash value 34, so t_max is 32.
Suppose the hash function 32 can completely randomize a digital sequence 30 so that the probability of occurrence of the hash value 34 is uniform. Since the probability of all of the t endmost bits of the hash value 34 becoming zeros is ½t, the value oft can then be determined by the following equation.
(length of digital sequence 30)×½t=(output partition number)−1.
In practice, however, the original digital sequence 30 can be randomized only to some extent by the hash function 32, so the value oft often differs from the value calculated from the above equation. The technique shown in Non Patent Literature 1 therefore changes t until the number of partitions almost matches the output partition number.
Patent Literature 1 also divides a digital sequence 30 in almost the same way. In the following description, t is called a “level”.
As described above, the conventional method adjusts the level so that the number of partitions matches the output partition number. So, if the level changes as a result of modification of a file, a fuzzy hash will become drastically different, giving rise to a problem that the produced fuzzy hash is unable to be used for similarity check. In dealing with this problem, it is an aspect of this embodiment to produce partition hashes for as many levels as possible.
More specifically, the first step is to set the level to t_max and determine dividing points. A point at which all of the hash value 34 are zeros is taken as a dividing point. The possibility of such a hash value being produced is low and therefore the number of resultant dividing points is also small. In the example of
Next, the level is lowered by one to t_max−1 and the similar step is taken to determine another dividing point. It is noted here that the point 300, which has been picked up as the dividing point for the level t_max is also selected as a dividing point for the lower level tmax−1. This is obvious from the definition of the level that if all of the t endmost bits of the hash value are zeros, the scan point in question is taken as a dividing point. In the example of
The similar operation is repeated until the total number of partitions for all levels reaches the output partition number. For each of the divided pieces of data thus obtained, a partition hash is calculated to output a fuzzy hash 36. This is a first aspect of this embodiment.
The fuzzy hash under consideration includes various levels of divided pieces of data, so that even if the level changes as a result of a file modification, as long as the divided data pieces of interest are included in a set of data pieces at a level prior to the file modification, the fuzzy hashes being compared do not assume totally different values. This embodiment therefore can be said to excel in similarity check accuracy, when compared to the conventional method which, in the event of a level change, may result in the fuzzy hash being unable to function correctly. More detailed description of the method of outputting and storing fuzzy hashes 36 will be given by referring to
Further, by taking advantage of the characteristic fact that dividing points at one level always become dividing points at lower levels, the similarity check using fuzzy hashes can be speeded up. The similarity checks utilizing this feature are second and third aspect of this embodiment, which will be described later referring to
Now that the difference between this embodiment and the conventional technique has been clarified, a fuzzy hash calculation flow of
(Step 400) The initial setting unit 200 sets parameters for processing units 202, 204, 206, 208, 210, for example, an output partition number. Further, as described later by referring to
Further, the initial setting unit 200 sets miscellaneous parameters such as K in
The user can set the aforementioned items through the initial setting unit 200. Conversely, the initial setting unit 200 allows the user to fix a part of the setting items to prevent it from being changed.
Those items set or fixed by the initial setting unit 200 are notified, as required, by the unit 200 to the associated processing unit through the memory 140 or storage 100.
(Step 402) The file read unit 202 reads files stored in the file storage unit 102. The file reading may be done when the file read unit 202 monitoring the file storage unit 102 detects a file being stored into the file storage unit 102 or when a new file is created. It is also possible to crawl the file storage unit 102 and successively read all files stored there. Or in response to an instruction from the user through the input/output interface 160, the file read unit 202 may read a set of files specified by the user.
In either case, when a fuzzy hash is calculated from the file read in according to the steps shown in
The file read unit 202 may also read blocks at lower levels, rather than electronic files, through a block IO.
The file read unit 202 temporarily stores in the memory 140 a file read in or a block read in through the block IO as a digital sequence and calls up a normalization unit 204.
The destination in which a digital sequence is to be stored may be a storage 100. In the following description, the word “memory 140”, whenever it appears, also implies the storage 100.
The call-up operation may involve starting a processing unit in the called-up device (when the device of interest is already running, no action is taken) to notify the processing unit of the destination device in which the digital sequence saved in the memory 140 is to be stored. It may also be possible to send the digital sequence per se to the processing unit in the called-up device. In the following, the call-up operation implies what is mentioned above.
(Step 404) The normalization unit 204 removes from the digital sequence on the memory 140 information not necessary for calculation of fuzzy hash. More specifically, it extracts only text information from the digital sequence and performs shaping operations on the text, such as removing blanks and eliminating irregularities or unevenness among characters or words for more unified form or consistency. For details of such normalization operations, see a pamphlet of international publication No. WO2006/122086.
The normalization unit 204 and the step 404 are not essential in this embodiment. That is, with the method and apparatus of this embodiment allow a fuzzy hash to be calculated directly from a digital sequence without having to extract text information from the digital sequence and shape it.
The normalization unit 204 temporarily stores the normalized data in the memory 140 and calls up the data dividing unit 206. If the fuzzy hash calculation unit 122 does not include the normalization unit 204, it temporarily stores in the memory 140 the data that the file read unit 202 has read in before calling up the data dividing unit 206. In the description that follows, data on the memory 140 that are to be read by the data dividing unit 206 are referred to as “normalized data”.
(Step 406) To divide the normalized data on the memory 140, the data dividing unit 206 sets the level t to t_max and temporarily stores this value in the memory 140. Here, t_max is, as explained with reference to
(Step 408) The data dividing unit 206 determines dividing points on the normalized data in the memory 140 for the level t. That is, for each point of the normalized data, K-byte data with its head located at that point is put into the hash function 32. Any point at which all the t endmost bits of the resultant hash value 34 are zeros is taken as a dividing point. Here, K is the number of bytes required to produce the hash value 34 explained in
(Step 410) The data dividing unit 206 calculates the number of partitions from the set of dividing points determined by step 408 and checks whether the total number of partitions for each level exceeds the output partition number. If the output partition number is not exceeded, the processing moves to step 412 where it lowers the level t by one before repeating the operation from step 408 onward. If the total number of partitions for a particular level is in excess of the output partition number, the processing ends the dividing point determination operation before proceeding to step 414.
(Step 414) After dividing points have been determined by the processing of step 408 to step 412, the data dividing unit 206 divides the normalized data based on a set of the dividing points and temporarily stores a set of the divided pieces of data in the memory 140, after which it calls up the partition hash calculation unit 208.
(Step 416) The partition hash calculation unit 208 computes a partition hash for each of the divided data pieces on the memory 140. The calculation of the partition hashes may be done by, for example, a commonly used hash function mentioned in the following literature.
R. Rivest: “The MD5 Message—Digest Algorithm”, RFC 1321, April 1992.
The partition hash calculation unit 208 temporarily stores in the memory 140 a set of partition hashes calculated for each of the associated divided pieces of data and then calls up the fuzzy hash output unit 210.
(Step 418) The fuzzy hash output unit 210 determines a fuzzy hash from the set of partition hashes on the memory 140. At a stage of executing step 414, there is a possibility that the total number of partitions may be larger than the output partition number set by the initial setting unit 200. So, if the fuzzy hash is output as is, its length may be greater than is desired. In that case, the fuzzy hash output unit 210 adjusts the output size of the fuzzy hash either by omitting only excess partition hashes or discarding all partition hashes in a lowermost level set.
When this kind of omission is adopted, the omission processing may be done by the data dividing unit 206 at step 414. This offers an advantageous effect of reducing the amount of calculation performed by the partition hash calculation unit 208.
Although the fuzzy hash may increase in length, its length will not increase significantly. So, the fuzzy hash output unit 210 may be configured to output the excess partition hashes, rather than discarding them.
As a more effective output size adjusting method there has been known a method using the “Bloom filter”. The Bloom filter is a probabilistic data structure with good spatial efficiency and used to find out whether an element is a member of a particular set. Although it has a drawback that as the number of elements added to a set increases, the possibility of falsely determining elements not included in the set as belonging to that set increases, the Bloom filter can reduce the size of the set. In the following, the method of adjusting an output size based on the Bloom filter will be described in detail.
The Bloom filter is a bit sequence. Suppose its length is N. At step 418 the fuzzy hash output unit 210 groups the partition hashes obtained at step 416 by level and generates one or more Bloom filters for each level according to the method described below. After generating Bloom filters for all levels, the fuzzy hash output unit 210 links them together to produce a fuzzy hash before outputting it.
The Bloom filter is generated as follows. First, a bit sequence (Bloom filter) of a length N is prepared and all bits of the sequence are set to zeros. Further, k hash functions are prepared each of which, when data of an arbitrary length is entered, produces a value in a range from 0 to N−1. These hash functions produce k different hash values from the same data and have a different purpose from those of the hash function 32 (rolling hash) and partition hashes explained in
Next, the fuzzy hash output unit 210 selects one of the levels and, from among the set of partition hashes calculated by step 416, chooses one partition hash belonging to the selected level. Then, the fuzzy hash output unit 210 applies the k Bloom hash functions to the chosen partition hash to produce k output values (A—1, A—2, . . . , A_k). The fuzzy hash output unit 210 changes to 1 the values of Bloom filter bits at those positions corresponding to the k output values obtained (those bits in the sequence whose addresses are represented by A—1, A_, . . . , A_k). In the following, this operation to change bit values of the Bloom filter based on the partition hash is referred to as a “registration of partition hash”.
Next, the fuzzy hash output unit 210 selects from among the set of partition hashes calculated by step 416 another partition hash belonging to the selected level and performs the partition hash registration on it. Here, there is a possibility that, of those k bits in the Bloom filter that this round of partition hash registration is going to change, some may have already been changed to 1. In that case, their values are left unchanged at 1.
In the following steps, the fuzzy hash output unit 210 applies the partition hash registration to all the remaining partition hashes belonging to the selected level in the set of the partition hashes calculated by step 416. As a result, a Bloom filter is produced which has a part of its bit sequence changed to 1. Described above is the method of generating a Bloom filter corresponding to the selected level.
With the aforementioned Bloom filter generating method, the memory size required to represent a set of partition hashes belonging to one level can be made N bits.
By evaluating commonality of Bloom filters generated from different sets of partition hashes (as by counting the number of bits whose values match), it is possible to estimate how much the registered sets of partition hashes have in common. This is because the same partition hashes, when registered, will result in the bit values at the same positions in Bloom filters becoming 1. However, there is a possibility that, even when different partition hashes are registered, the bit values at the same positions in Bloom filters may also become 1. Generally the possibility of a false assessment will increase with the number of partition hashes registered in one Bloom filter. This possibility of false assessment may be reduced as by making the size of Bloom filter N large, or using a plurality of Bloom filters for one level (i.e., creating a new Bloom filter for registration when the number of registered partition hashes exceeds an upper limit).
A fuzzy hash, the final output, can be made smaller in size by reducing the value of N. As described above, however, there is a tradeoff between the accuracy in finding similarity between Bloom filters and the compactness in size of Bloom filters. So, in using Bloom filters, the value of N needs to be determined beforehand at step 400, taking the required precision and the calculation resources into account.
For the Bloom filter described above, see the following literature.
B. Bloom: “Space/Time Tradeoffs in Hash Coding with Allowable Errors”, Communications of the ACM 13:7, pp. 422-426, 2970.
The fuzzy hash output unit 210 outputs to the fuzzy hash storage unit 104 and/or the input/output interface 160 the fuzzy hash that has been obtained either by discarding an excess, outputting the fuzzy hash without discarding the excess or using a Bloom filter. It is noted, however, that because the comparison between fuzzy hashes requires finding common partition hashes for each level, the fuzzy hashes are output in a manner that makes clear which level the partition hashes belong to. It is also possible to allow the user to choose, through the initial setting unit 200, a desired method—either discarding an excess, outputting a fuzzy hash without discarding the excess or using Bloom filters.
With the above steps taken, the fuzzy hash calculation process is complete.
While at step 406 the level t has been set to the highest of the levels that the hash function 32 can determine, t_max, it is also possible to set the level t to lower than t_max and start dividing the normalized data from that level. The starting level is set by the initial setting unit 200 at step 400.
Conversely, the level t may be set greater than t_max. At this level the normalized data is not divided, so the partition hash belonging to the highest level of the fuzzy hash is always the normalized data's own partition hash. In this case, if the level t is higher than t_max, the data dividing unit 206 at step 408 may not execute the dividing point calculation operation but immediately proceed to step 410 by taking the partition number at level t as 1 (i.e., there is no dividing point). These operations are instead done by the initial setting unit 200 at step 400.
In the following description, the level t_max implies not only the highest level determined by the hash function 32 but also levels that are lower or higher than the highest level set by the initial setting unit 200.
Next, referring to
(Step 500) The data dividing unit 206 scans the normalized data on the memory 140 from the normalized data starting point one byte at a time to calculate dividing points on the normalized data. It sets the scan position p at 0 and temporarily saves this value in the memory 140.
(Step 502) The data dividing unit 206 reads the normalized data from pth piece of data up to (p+K−1)th. Here K represents the number of bytes required to determine the hash value 34 explained in
The data dividing unit 206 feeds the K bytes of data read in into the hash function 32 of
The hash function 32 may be a rolling hash described in Patent Literature 1 and Non Patent Literature 1 or any other kind of function. The user may set a desired function through the initial setting unit 200.
(Step 506) The data dividing unit 206 checks t endmost bits of the hash value 34 of interest to see if all of these bits are zeros. If all of them are zeros, the data dividing unit 206 takes p as a dividing point candidate and temporarily saves the value of p before moving to step 508. If not, the unit 206 jumps to step 512.
The condition for determining the dividing point does not need to be limited to the one in which the t endmost bits are all 0's. In essence, the only requirement is whether t bits extracted according to a predetermined rule match a preset bit sequence. For example, if a rule is adopted that a point under consideration is taken as a dividing point only when t most significant bits or foremost bits are 0101 . . . , a decision on whether the point of interest is a dividing point need only be made according to that rule. Such a rule is set by the initial setting unit 200.
(Step 508) The data dividing unit 206 compares the dividing point candidate p determined by step 506 with a point p0, that was last stored in the memory 140 at step 510, to calculate an interval p-p0. If p0 does not exist, the head of the normalized data on the memory 140 is used instead (p0=0).
If this interval is greater than a minimum partition interval determined beforehand by the initial setting unit 200, p is taken as a dividing point and the processing moves to step 510. If not, the data dividing unit 206 decides that p cannot be regarded as a dividing point, and jumps to step 512.
(Step 510) the data dividing unit 206 adds p to a set of dividing points and temporarily stores the dividing point set in the memory 140.
(Step 512) If p+K−1 is located at the tail end of the normalized data, the data dividing unit 206 decides that the normalized data has all been scanned and exits the processing. If not, the data dividing unit 206 moves to step 514 where it increments p by 1, before repeating the process from step 502 onward.
If at step 508 the data dividing unit 206 determines p to be a dividing point, the position where the next dividing point will occur is beyond the minimum partition interval d added to p. So, step 514 may increment p by d, instead of 1. In that case, step 512 checks whether p+K+d−1, not p+K−1, is located at the end of the normalized data.
With the above operations done, the processing of step 408 is complete.
Now the detailed method of outputting and storing the fuzzy hash thus obtained will be explained by referring to
As explained in
The fuzzy hash output unit 210 of
The fuzzy hash storage unit 104 of
Further, the fuzzy hash management table 62 may hold information on the locations and length in a digital sequence of individual divided pieces of data corresponding to the partition hashes making up each of the fuzzy hashes. The use of these information, as explained later with reference to
In a general file system attributes of a file are managed by a folder containing that file. The fuzzy hash storage unit 104 may manage in each folder the fuzzy hash management table 62 together with the file attributes. If an expansion area 640 to which external data can be added exists on the same file system to which belongs a file 64 whose fuzzy hash has been calculated, the fuzzy hash can be written into the expansion area 640. These methods obviate the need for the fuzzy hash management table 62.
In this embodiment, for quick search for similar files distances between fuzzy hashes are calculated beforehand by the distance calculation unit 124 and stored in the distance storage unit 106. To achieve this objective, the control unit 110, when a fuzzy hash for a file is calculated and stored in the fuzzy hash storage unit 104, sends that fuzzy hash and other fuzzy hashes already stored in the fuzzy hash storage unit 104 to the distance calculation unit 124.
First, the configuration of the distance calculation unit 124 will be explained by referring to
The distance calculation unit 124 has a fuzzy hash reading unit 702 to read two fuzzy hashes from the fuzzy hash storage unit 104; a partition hash matching unit 704 to identify a common partition hash from the fuzzy hashes read in; a comparison excluding unit 706 to determine if a partition hash of interest is to be excluded from the comparison operation; and a distance output unit 708 to calculate and output a distance between the fuzzy hashes based on the portions of partition hashes that have been determined to match. The distance calculation unit 124 also includes an initial setting unit 700 that sets parameters for the processing units 702, 704, 706 and 708.
In the calculation of a distance between fuzzy hashes, the processing units 700, 702, 704, 706, 708 store data temporarily in the memory 140 for processing, such as checking, editing and deletion.
The distance between the fuzzy hashes determined by the distance calculation unit 124 is stored in the distance storage unit 106 by the distance output unit 708. Alternatively, it may be presented to the user, for example, on a display through the input/output interface 160.
Before proceeding to describe the detailed operations of individual processing units shown in
The partition hash matching unit 704 first compares partition hashes at the highest level t_max. In the example of
Next at level t_max−1, the partition hash matching unit 704 compares each of partition hashes H(1, 1), H(1, 2), H(2, 1), H(2, 2) and H(2, 3) with G(1, 1), G(1, 2), G(1, 3), G(2, 1), G(2, 2) and G(2, 3) to see if there is any match. In the example of
With these sets of partitions removed, the partition hash matching unit 704 at the next level t_max−2 performs the comparison operation on those partition hashes not belonging to the partition hash sets 800 and 820.
Because matching partition hashes, if found at a high level, are removed from those partition hashes to be compared at lower levels as described above, the distance calculation can be made faster.
As a final step, the distance output unit 708 calculates a distance based on the total number of partition hashes at the lowest level t_max−2 and the number of partition hashes found to match by the above comparison. Here the distance is defined as the number of partition hashes that fail to match. In the example of
The method for finding common partition hashes for each level generally entails a large amount of computations. When, for example, a file is edited to change the order of sentences, there is a possibility that the order of partition hashes may also change. Therefore, to extract matching portions correctly requires partition hashes to be compared one by one.
As a distance calculation method that efficiently finds common portions by considering the possibility of partition hashes changing in their order, there is known a method that uses an edit graph, as described in a literature cited below. The edit graph method is an approach originally proposed to match character sequences against each other. If partition hashes are regarded as characters, the edit graph method can be applied to calculating the distance.
E. W. Myers: “An 0(ND) difference algorithm and its variations”, Algorithmixa, 1, pp. 251-266 986.
This method will be described in detail by referring to the fuzzy hash management table 62 of
When the fuzzy hash output unit 210 outputs a fuzzy hash using the aforementioned Bloom filter, it matches the Bloom filters against each other to calculate their distance. More precisely, commonality between the Bloom filters, that are generated from different sets of partition hashes, is evaluated (as by counting the number of bits whose values match), making it possible to determine how much commonality there is between the registered sets of partition hashes. For more detail, see the literature cited below.
Brin S., Davis J., Garcia-Molina H.: “Copy detection mechanisms for digital documents”, Proceedings of the ACM SIGMOD annual conference, San Francisco, Calif., May 1995.
Even with the use of the edit graph and the Bloom filter in calculating a distance, the calculation volume will increase depending on the number of partition hashes. To deal with this problem, this embodiment presents a method that focuses on the fact that the higher the level, the lower the probability of occurrence of a dividing point and which starts to find common portions from the highest level where there is the least number of partitions, moving one level down at a time, to minimize the number of partition hashes at lower levels that have to undergo the comparison operation, thereby reducing the calculation volume. This is a second aspect of this embodiment.
When the distance calculation is carried out as described above, since the numbers of partition hashes at the lowest level do not always agree, a fuzzy hash distance table 1100 stored in the distance storage unit 106, which will be explained referring to
To make the distance table a symmetric matrix, a method is conceivable which calculates the distance using the total number of differing partition hashes ranging from the highest level t_max to the lowest level, rather than counting them at only the lowest level. This is because the output partition numbers of two fuzzy hashes are equal and the numbers of differing partition hashes that are obtained by subtracting the number of partition hashes deemed common at all levels from the output partition number are also equal.
Further, although the above method calculates the distance based on the number of partition hashes, if the fuzzy hash management table 62 explained by referring to
The method of calculating the distance between fuzzy hashes will be explained further by referring to
With the conventional method described in Patent Literature 1 and Non Patent Literature 1, only a set of partition hashes at the lowest level is output as a fuzzy hash. So, if the length of a file should change as by editing, as shown in
On the other hand, this embodiment compares the fuzzy hashes at the same level and regards the minimum value of common level as a lowest level for use in the calculation of the distance (hereinafter referred to as a “common lowest level”). In the example of
A flow of the fuzzy hash distance calculation method will be explained by referring to
(Step 1000) The initial setting unit 700 executes settings for the processing units 702, 704, 706, 708, for example, allowing the distance calculation method, such as edit graph or Bloom filter, implemented by the distance output unit 708 to be selected.
The user can set the above items through the initial setting unit 700. Conversely, it is also possible to fix parts of the setting items so that they cannot be set by the initial setting unit 700. In the following description it is assumed that the items either set or fixed by the initial setting unit 700 are notified, as required, to the associated processing units by the initial setting unit 700 through the memory 140 or storage 100.
(Step 1002) The fuzzy hash reading unit 702 reads two fuzzy hashes from the fuzzy hash storage unit 104 and temporarily stores them in the memory 140. The fuzzy hashes to be read in are specified by the control unit 110 which also starts the fuzzy hash reading unit 702, when necessary. In addition, the control unit 110, when prompted by the user through the input/output interface 160, may read in fuzzy hashes specified by the user.
When the fuzzy hash reading unit 702 has read in two fuzzy hashes and saved them in the memory 140, the processing moves to step 1004.
The storage 100 may be used as the destination in which to temporarily store the fuzzy hashes. In the following description, the word “memory 140”, whenever it appears, also implies the storage 100.
(Step 1004) The fuzzy hash reading unit 702 calculates how many levels there are to each of the two fuzzy hashes on the memory 140 and determines a common lowest level t_min, the lowest of those levels common to two sets of levels (the lowest level of a product of two level sets). The fuzzy hash reading unit 702 temporarily stores the common lowest level t_min on the memory 140 before calling the partition hash matching unit 704.
The call-up operation may involve starting the target processing unit (if the processing unit of interest is already running, nothing is done) and notifying the processing unit of the destination in which the data temporarily saved in the memory 140 is to be stored, or picking up the data itself and sending it to the target processing unit. In the descriptions that follow, the call-up operation implies the operation described above.
(Step 1006) To identify common partition hashes from the fuzzy hashes, the partition hash matching unit 704 sets the level t to t_max and temporarily stores this value in the memory 140.
(Step 1008) The partition hash matching unit 704 identifies matching partition hashes in a partition hash set in level t as by the edit graph method explained in
The partition hash matching unit 704 temporarily stores in the memory 140 information about which partition hashes are identical, before calling up the comparison excluding unit 706.
(Step 1010) For levels lower than a level where some partition hashes are newly determined by step 1008 to be identical between the two fuzzy hashes, the comparison excluding unit 706 records a set of those partition hashes at the lower levels that corresponds to the identical partition hashes as being excluded from comparison. This record is temporarily stored in the memory 140.
(Step 1012) The comparison excluding unit 706 checks whether the current level t is greater than the common lowest level t_min stored in the memory 140. If so, the processing moves to step 1014 where it decrements t by one and repeats the operation from step 1008 onward. If not, the comparison excluding unit 706 calls the distance output unit 708 before jumping to step 1016.
(Step 1016) The distance output unit 708 calculates the distance from the number of common partition hashes on the memory by the method explained in
In the aforementioned flow, the process of finding common partition hashes has been described to start from the highest level t_max where there is the least number of partitions and move one level down at a time to minimize the number of partition hashes at lower levels that need to be compared, thereby reducing the calculation volume. However, if the algorithm, such as edit graph and Bloom filter, to identify common portions is able to run at high speed because of sufficient computation capability of CPU 120, the initial value t can be set to less than t_max. At this time the initial setting unit 700 at step 1000 sets the initial value t0_max, and at step 1006 t is replaced with
t=max(t0_max, t_min)
If the common lowest level t_min is greater than t0_max, common hashes are searched directly at level t_min.
Further, it is possible to adopt a method that matches the fuzzy hashes against each other at only the common lowest level. In that case, there is no need for the initial setting unit 700 to hold the threshold. Nor does the distance calculation unit 124 need to have the comparison excluding unit 706.
In the above, we have explained the method of calculating fuzzy hashes and the method and apparatus for calculating distances between fuzzy hashes by referring to
The distance storage unit 106 has a distance table 1100 for managing the distances between fuzzy hashes and a fuzzy hash management table 1120 for managing the relations between fuzzy hashes and files. In the example of
Provision of the distance table 1100 and the fuzzy hash management table 1120 makes it possible to quickly find a fuzzy hash close to a given unknown fuzzy hash. The high speed search is performed as follows. When an unknown fuzzy hash is given, some fuzzy hashes are picked up from the fuzzy hash management table 1120 and their distances from the given fuzzy hash are calculated. Next, the distance table 1100 is searched to find a distance value distribution similar to a distribution of the calculated distance values. Fuzzy hashes associated with the distance value distribution thus found can be identified from the distance table 1100 and then strictly examined to determine how close they are to the unknown fuzzy hash. Since this method performs comparison not for all fuzzy hashes but for only some representatives, a fuzzy hash closest to the unknown fuzzy hash can be found quickly. For more detail, see the following literature.
Edgar Chavez, Gonzalo Navarro, Ricardo Baeza-Yates and Jose L. Marroquin: “Searching in metric spaces”, ACM Computing Surveys 33, 3, pp. 273-321, 2001.
It has generally been known that, to realize a high-speed search, the distance table 1100 is preferably a symmetric matrix. As explained earlier with reference to
In the example of
Peter N. Yianilos: “Data structures and algorithms for nearest neighbor search in general metric spaces”, ACM-SIAM Symposium on Discrete Algorithms (A Conference on Theoretical and Experimental Analysis of Discrete Algorithms), pp. 311-321, 1993.
The fuzzy hash management table 1120 is similar to the table 62 explained in
The file search unit 126 outputs a set of files similar to a file 1210 as a search result 1212. The file search unit 126 has a file read unit 1200 to read the file 1210 through the input/output interface 160 to calculate a fuzzy hash in cooperation with the fuzzy hash calculation unit 122; a distance index unit 1202 to determine a fuzzy hash near the calculated fuzzy hash by using information stored in the distance storage unit 106; and a similar file output unit 1204 to output information on a file corresponding to the nearest fuzzy hash as a search result 1212 through the input/output interface 160.
For details of the search algorithm of the distance index unit 1202, see the literature cited above. In this embodiment, detailed explanations of the algorithm is omitted.
The file search unit 126 outputs the file similar to the file 1210 as the search result 1212. The number of similar files to be output as the search result 1212 can be set by the initial setting unit, not shown, in the file search unit 126. Further, if the fuzzy hash management table 1120 in the distance storage unit has information on the locations of divided pieces of data corresponding to partition hashes that make up a fuzzy hash, it is also possible to present which part of the similar file matches the file 1210 as the search result 1212.
Further, similar files can be searched without preparing the distance table 1100 in advance. The configuration of the file search unit that may be used in that case is shown at 126-2 in
The file search unit 126-2 has the file read unit 1200 and the similar file output unit 1204, and also includes a distance calculation unit 124-2, in place of the distance index unit 1202, that determines a nearest fuzzy hash by using information stored in the fuzzy hash storage unit 104.
The distance calculation unit 124-2 has a similar configuration to the distance calculation unit 124 shown in
With this method using the distance calculation unit 124-2, the digital sequence feature amount calculation apparatus 10 does not need to have the distance storage unit 106. This method, though its search speed is slower than the file search unit 126, has the advantage of being able to reduce the capacity of the storage 100 because the distance storage unit 106 is not required.
With the method and devices shown in
Fuzzy hashes have two characteristics: (1) they allow a similarity check among different files and (2) their size is small and fixed. To meet the characteristic (2), the conventional techniques of Patent Document 1 and Non Patent Document 1 adjust the level to keep the output size constant. This adjustment, however, often results in a distance between two fuzzy hashes failing to be correctly calculated when the length of a file has changed. To deal with this problem, Embodiment 1 has proposed a method which sets an output partition number beforehand and outputs, within a range not exceeding the output partition number, all partition hashes produced through division at various levels.
Either of these methods introduces some means to satisfy the requirement (2). It is noted, however, that the requirement (2) itself is a restraint intended to avoid fuzzy hashes imposing onerous burden on the storage capacity and that there may be cases where the requirement (2) may be eliminated, as when the storage capacity is sufficiently larger than files under consideration. In that case, the size of feature quantity can be increased in proportion to the file size, giving rise to an expectation that the similarity check accuracy will, because of increased volume of information, improve over the conventional techniques and Embodiment 1, both of which throw away some parts of information to make the output size conform to the fixed length under the restraint of (2).
Thus, Embodiment 2 provides a method of calculating a feature quantity of a digital sequence that excludes the requirement (2), and a similar file search method. This embodiment also offers an apparatus for implementing these methods.
In the description that follows, a feature quantity with the requirement (2) excluded is called a “variable fuzzy hash”. It is “variable” because this feature quantity which is no longer restrained by the requirement (2) can be expanded in size according to the length of a file.
In the following, it will be made clear, by applying
For detailed explanation of this dividing method, an example flow chart of a variable fuzzy hash calculation method will be described by applying
(Step 400) This step is almost the same as step 400 of Embodiment 1. It is noted, however, that the initial setting unit 200, rather than setting the output partition number, sets an “output level” as a fixed value used for generating variable fuzzy hashes and for calculating distances using the variable fuzzy hashes. In Embodiment 1 the output partition number has been set to fix the output size, whereas in this embodiment the output level is introduced in place of the output partition number.
(Step 402) This step is the same as step 402 of Embodiment 1.
(Step 404) This step is the same as step 404 of Embodiment 1.
(Step 406) The data dividing unit 206, to divide normalized data on the memory 140, sets a level t at an output level t0 and temporarily saves this value in the memory 140. In finding dividing points, Embodiment 1 starts from the highest level t_max, moving one step down at a time. This embodiment determines the dividing points only at the output level t0.
(Step 408) This step is the same as step 408 of Embodiment 1.
(Step 410) There is no output partition number in this embodiment, so the processing moves directly to step 414, without comparing the partition number.
(Step 414) This step is the same as step 414 of Embodiment 1.
(Step 416) This step is the same as step 416 of Embodiment 1.
(Step 418) The fuzzy hash output unit 210 outputs a set of partition hashes from the memory 140 as is, as the variable fuzzy hashes.
It is noted that in this embodiment, too, the output size can be adjusted by using Bloom filters.
With the above steps taken, the variable fuzzy hash calculation process is complete.
For detailed explanation of a method of calculating a distance between variable fuzzy hashes, an example flow chart will be described by applying
(Step 1000) This step is almost the same as step 1000 of Embodiment 1, except that the initial setting unit 700 does not make any setting on the comparison excluding unit 706.
(Step 1002) This step is the same as step 1002 of Embodiment 1.
(Step 1004) The fuzzy hash reading unit 702 does nothing in this step but calls up the partition hash matching unit 704 before jumping to step 1008.
(Step 1006) This step does not exist in this embodiment.
(Step 1008) The partition hash matching unit 704 identifies matching portions between two sets of partition hashes, each set forming a variable fuzzy hash. The partition hash matching unit 704 temporarily stores in the memory 140 information about which partition hashes are identical, before calling up the distance output unit 708. It then jumps to step 1016.
(Step 1010 to step 1014) These steps do not exist in this embodiment.
(Step 1016) This step is the same as step 1016 of Embodiment 1.
As described above, in the calculation of a distance between variable fuzzy hashes, because the variable fuzzy hashes have only one level, there is no possibility of the variable fuzzy hashes being compared at two different levels. Therefore, this embodiment is highly likely to be able to calculate the distance more precisely than the conventional techniques—which may not be able to make correct judgment on similarity when file modifications or the like result in a fuzzy hash being expanded in size to change its level—and also Embodiment 1 which has reduced the possibility of occurrence of this undesired phenomenon by using a set of partition hashes at a plurality of levels. It should be noted, however, that since the variable fuzzy hash changes in length according to the file size, this may place onerous burden on the storage capacity.
The methods and apparatus described in
Number | Date | Country | Kind |
---|---|---|---|
2010-155333 | Jul 2010 | JP | national |
Filing Document | Filing Date | Country | Kind | 371c Date |
---|---|---|---|---|
PCT/JP2011/052097 | 2/2/2011 | WO | 00 | 2/26/2013 |