As computational power and performance continue to increase more and more enterprises are storing data in databases for use in their business. Furthermore, enterprises are also collecting ever increasing amounts of data. The data is stored as records, tables, tuples and other grouping of related data, herein after referred collective to as tuples. The data is stored queried, retrieved, organized filtered, formatted and the like by evermore powerful database management systems to generate vast amounts of information. The extent of the information is only limited by the amount of data collected and stored in the database.
Unfortunately, multiple seemingly distinct tuples representing the same entity are regularly generated and stored in the database. In particular, integration of distributed, heterogeneous databases can introduce imprecision in data due to semantic and structural inconsistencies across independently developed databases. For example, spelling mistakes, inconsistent conventions, missing attribute values, and the like often cause the same entity to be represented by multiple tuples.
The duplicate tuples reduce the storage space available, may slow the processing speed of the database management system, and may result in less then optimal query results. In the conventional art, fuzzy duplicate tuples may be identified whose similarity is greater than a user-specified threshold utilizing a conventional similarity function. One method includes exhaustive apply the similarity function to all pairs of tuples. In another method, a specialized indexes (e.g., if available for the chosen similarity function) may be utilized to identify candidate tuple pairs. However, the index-based approaches result in a large number of random accesses while the exhaustive search performs a substantial number of tuple comparisons.
The techniques described herein are directed toward probabilistic algorithms for detecting fuzzy duplicates of tuples. Candidate tuples are grouped together through a limited number of scans and sorts of the base relation utilizing locality sensitivity hash vectors. A similarity function is applied to determine if the candidate tuples are fuzzy duplicates. In particular, each tuple is converted into a vector of hash values utilizing a locality sensitive hash (LSH) function. All of the hash vectors are sorted on one or more select hash coordinates, such that tuples that share the same hash value for a given vector coordinate will cluster together. Tuples that cluster together for a given vector coordinate are identified as candidate tuples, such that probability of not detecting a fuzzy duplicate is bounded. The candidate tuples are compared utilizing a similarity function. The tuple pairs that are more similar than a predetermined threshold are returned.
Embodiments are illustrated by way of example and not by way of limitation, in the figures of the accompanying drawings and in which like reference numerals refer to similar elements and in which:
The input/output devices 125, 130 may include one or more communication ports 130 for communicatively coupling the computing device 105 to one or more other computing devices 140, 145. The one or more other devices 140, 145 may be directly coupled to one or more of the communication ports 130 of the computing device 105. In addition, the one or more other devices 140, 145 may be indirectly coupled through a network 150 to one or more of the communication ports 130 of the computing device 105. The networks 150 may include an intranet, an extranet, the Internet, a wide-area network (WAN), a local area network (LAN), and/or the like.
The communication ports 130 of the computing device 105 may include any type of interface, such as a network adapter, modem, radio transceiver, or the like. The communication ports 130 may implement any connectivity strategies, such as broadband connectivity, modem connectivity, digital subscriber link (DSL) connectivity, wireless connectivity or the like. It is appreciated that the communication ports 130 and the communication channels 155-165 that couple the computing devices 105, 140, 145 provide for the transmission of computer-readable instructions, data structures, program modules, code segments, and other data encoded in one or more modulated carrier waves (e.g., communication signals) over one or more communication channels 155-165. Accordingly, the one or more communication ports 130 and/or communication channels 155-165 may also be characterized as computer-readable media.
The computing device 105 may also include additional input/output devices 125 such as one or more display devices, keyboards, and pointing devices (e.g., a “mouse”). The input/output devices 125 may further include one or more speakers, microphones, printers, joysticks, game pads, satellite dishes, scanners, card reading devices, digital cameras, video cameras or the like. The input/output devices 125 may be coupled to the bus 135 through any kind of input/output interface and bus structures, such as a parallel port, serial port, game port, universal serial bus (USB) port, video adapter or the like.
The computer-readable media 115, 120 may include system memory 120 and one or more mass storage devices 115. The mass storage devices 115 may include a variety of types of volatile and non-volatile media, each of which can be removable or non-removable. For example, the mass storage devices 115 may include a hard disk drive for reading from and writing to non-removable, non-volatile magnetic media. The one or more mass storage devices 115 may also include a magnetic disk drive for reading from and writing to a removable, non-volatile magnetic disk (e.g., a “floppy disk”), and/or an optical disk drive for reading from and/or writing to a removable, non-volatile optical disk such as a compact disk (CD), digital versatile disk (DVD), or other optical media. The mass storage devices 115 may further include other types of computer-readable media, such as magnetic cassettes or other magnetic storage devices, flash memory cards, electrically erasable programmable read-only memory (EEPROM), or the like. Generally, the mass storage devices 115 provide for non-volatile storage of computer-readable instructions, data structures, program modules, code segments, and other data for use by the computing device. For instance, the mass storage device may store an operating system 170, a database 172, a database management system (DBMS) 174, a probabilistic duplicate tuple determination module 176, and other code and data 178.
The system memory 120 may include both volatile and non-volatile media, such as random access memory (RAM) 180, and read only memory (ROM) 185. The ROM 185 typically includes a basic input/output system (BIOS) 190 that contains routines that help to transfer information between elements within the computing device 105, such as during startup. The BIOS 190 instructions executed by the processor 110, for instance, causes the operating system 170 to be loaded from a mass storage device 115 into the RAM 180. The BIOS 190 then causes the processor 110 to begin executing the operating system 170′ from the RAM 180. The database management system 174 and the probabilistic duplicate tuple determination module 176 may then be loaded into the RAM 180 under control of the operating system 170′.
The probabilistic duplicate tuple determination module 176′ is configured as a client of the database management system 174′. The database management system 174′ controls the organization, storage, retrieval, security and integrity of data in the database 172. The probabilistic duplicate tuple determination module 176′ converts each tuple to a vector of hash values utilizing a locality sensitive hashing algorithm. The hash vectors are sorted, on one or more vector coordinates, to cluster similar hash values (e.g., tuples) together. Each cluster of similar hash values identify candidate tuples The module 176′ probabilistically detects candidate fuzzy duplicate tuples by selecting a set of vector coordinates to sort upon. The module compares the candidate fuzzy duplicate tuples utilizing a similarity function and returns pairs of tuples which are more similar than a specified threshold.
In one implementation, the number of vector coordinates to sort upon is selected as a function of a specified threshold of similarity and a specified error probability of not detecting a fuzzy duplicate. In another implementation, the probabilistic duplicate determination module 176′ selectively chooses buckets to determine which tuples to compare. The buckets are chosen as a function of the frequency of the hash coordinate values of a particular hash value. In another implementation, the module 176′ groups multiple hash coordinates together. The vectors are sorted based upon one or more of the groups of hash coordinates. In yet another implementation, the module groups multiple hash coordinates together and chooses one or more groups to sort upon based upon the collective frequency of hash coordinate values in the groups of hash coordinates.
Although for purposes of illustration, the database 172, database management system 174 and probabilistic duplicate detection module 176 are shown implemented on a single computing device 105, it is appreciated that the system may be implemented in a distributed computing environment. For example, the database 172 may be stored on a data store 140, and the probabilistic duplicate detection module 176 may be executed on a client computing device 145. The database management system 174 may be implemented on a server computing device 105 communicatively coupled between the data store 140 and the client computing device 145.
In one implementation, fuzzy duplicates may be determined utilizing a min-hash function and the Jaccard Similarity Function. Referring to
MinHash(R)=[ID, mh1, mh2, . . . , mhh]
is generated for each tuple. A locality sensitive hashing scheme with respect to similarity function f is a distribution on a family H of hash functions on a collection of objects, such that for two objects x and y, PrhεH[h(x)=h(y)]=f(x,y). One instance of the locality sensitive hashing scheme is the min-hash function. The min-hash function h maps elements U uniformly and randomly to the set of natural numbers N, wherein U denotes the universe of strings over an alphabet Σ. The min-hash of a set S, with respect to h, is the element x in S minimizing h(x) such that mh(S)=arg minxεsh(x). A min-hash vector of S with identifier ID is a vector of H min-hashes (ID, mh1, mh2, . . . mhH), where mhi=arg minxεshi(x) and h1, h2, . . . , hH are H independent random functions.
Sorting MinHash(R) on each of the min-hash coordinates mhi clusters together tuples that are potentially close to each other. The pairs of tuples which are in the same cluster are compared using a similarity function. A cluster of tuples by a given hash coordinate is referred herein to as a bucket. More specifically, a bucket B(i,c), specified by an index i and a hash value c, is the set of all min-hash vectors that have value c on mhi. The size of the bucket is the number of hash vectors (e.g., tuples) in the bucket. For example, sorting on the first coordinate mh1 yields seven buckets, with tuples 2 and 6 sharing the same hash value. Thus, sorting on the first hash coordinate mh1 generates one candidate pair (2,6) Sorting on the second hash coordinate mh2 generates thirteen candidate pairs from the bucket containing five tuples and the other bucket containing three tuples. Sorting on the third coordinate mh3 generates five candidate tuple pairs. Sorting on the fourth coordinate mh4 also generates five candidate tuple pairs.
The number of tuple comparisons is proportional to the sum of squares of the frequency of each of the distinct hash values. Only pairs of tuples that fall into the same bucket are compared, which significantly reduces the number of similarity function tuple comparisons. Besides the reduction of comparisons, sorting on min-hash coordinates results in natural clustering and avoids random accesses to the base relation. Candidate tuples may be identified such that the probability with which any pair of tuples in the input relation whose similarity is above a specified threshold is bounded by a specified value. The probabilistic approach allows reduction in the number of sorts of the min-hash vectors and the base relation and the number of candidate tuples compared. In particular, probabilistic fuzzy duplicate detection for any candidate tuple pair (u, v), such that the similarity function f(u, v) is greater than a threshold θ, returns the tuple pair (u, v) with probability of at least 1−ε. Wherein the error bound ε is the probability with which one may miss tuple pairs whose similarity is above θ. The number of hash vector coordinates h needed to identify candidate tuple pairs is determined by the error bound ε and the threshold θ as follows:
h=ln(ε)/ln(1−θ)
For example, with threshold θ=0.9, ε=0.01, h=2 min-hash coordinates are required.
The choices underlying when to compare two tuples lead to several instances of probabilistic algorithms for detecting pairs of fuzzy duplicates. Referring now to
Hash vector coordinates are selected for each tuple such that the total number of selected tuple pairs to be compared is minimized. In particular, one or more hash coordinates (k) for a particular hash vector are selected as a function of the frequency of hash values of the vector, at 520. More specifically, the frequencies of hash values are determined for each coordinate of a particular hash vector. The k selected coordinates for the particular vector are coordinates that have smaller frequencies (e.g., smallest bucket), as compared to the vector coordinate having the highest frequency. It is appreciated that vector coordinates having frequencies of one are not selected because they indicate that there is no potential duplicate tuple.
The tuples are compared based upon the selected vector coordinates. For each coordinate i, of a particular hash vector, the hash vectors are sorted to group tuples together, at 530. At 540, a tuple whose ith coordinate is selected is compared with tuples that share the same hash value as the selected hash vector coordinate; this procedure identifies candidate tuples. The candidate tuple are compared utilizing a similarity function, at 550. The pairs of tuples that are more similar than a predetermined threshold are returned. In one implementation, the similarity function may be a Jaccard similarity function, some variant of the Jaccard similarity function, a cosine similarity function, an edit distance similarity function or the like.
Accordingly, the smallest bucket algorithm exploits the variance in sizes of buckets (e.g., lower frequency for a given coordinate), over each of its hash coordinates, to which a tuple belongs. The higher the variance, the high the reduction in the number of tuple comparisons. However, the reduction in comparisons has to be traded off with the increased cost of materializing and sorting due to additional min-hash coordinates.
The choice of parameters can significantly influence the running times of various algorithms described above. In particular, let TB denote the time to build min-hash relations. TB is linearly proportional to H, the total number of min-hash coordinates per tuple. Let TB=T1+H·CB for positive constants T1 denoting the initialization overhead and CB denoting the average cost for materializing each additional min-hash coordinate. Let TC denote the time to evaluate the similarity function over all candidate pairs. TC=NC·CC where NC is the number of candidate pairs and CC is the average cost of evaluating the similarity function once. Let TQ denote the time to order the base relation. The cost here is equal to the number of times the relation is sorted times the average cost for sorting it once. (TQ can include where necessary the cost for joining with MinHash(R) and the temporary relation with the coordinate selection information.) Let TQ=T2+q·CQ, where q is the number of sort required by the algorithm, for appropriate positive constants T2 and CQ. Here, we assume that the average sorting cost is independent of the number of sort columns.
Given input data size and machine performance parameters, we can accurately estimate through test runs the constants CB, CQ and CC. The relevant parameters for the smallest bucket (SB) algorithm are h, the number of min-hash coordinates, and k, the number of min-hash coordinates selected per tuple. The cost of the SB algorithm is approximately equal to T1+T2+h·CB+h·CQ+NC·CC. One estimates NC given h and k and then choose values for h and k which minimize the overall cost. This is feasible because if the Jaccard similarity of (u,v) is greater than or equal to 0 then with probability at least 1−Σ(hj)θj(1−θ)h-j evaluated for j=0 to h-k, (u,v) is output by the smallest buckets algorithm. Accordingly, the value for h is constrained for a given k and vice-versa.
For the SB algorithm, the number of candidate pairs generated for any tuple u is bounded by the sum of sizes of the k smallest buckets selected corresponding to u. If one knows the distribution of the ith smallest min-hash coordinate, 1≦i≦k, then we can estimate the total number NC of candidate pairs. Towards this goal, we can rely on standard results from order statistics. Given the density distribution f(x) and the cumulative distribution F(x) of bucket sizes for any min-hash coordinate, we can estimate the density distribution f(X[i]) for the ith smallest (of totally h) bucket size as follows:
F(X[i])=hf(x)(h-1i-1)F(x)i-1(1−Ff(x))h-1
Using sampling-based methods to estimate the distribution f(x). The expected number of candidate pairs from one tuple is bounded by ΣE[X[i]] evaluate from i=1 to k, and the expected number of total candidates is estimated as n·ΣE[X[i]], where n is the number of tuples in the database. Using the values of NC, CB, CQ and CC, we determine the values of h and k which minimize the overall cost.
Referring now to
Hash vector coordinates are grouped such that the total number of candidate tuple pairs to be compared is reduced. In particular, the hash vectors are divided into groups of hash coordinates, at 620. The hash vectors are sorted based upon the selected group of vector coordinates, at 630. Hash vectors having the same hash values for each of the hash coordinates in the group will cluster together. At 640, candidate tuple pairs are determined from the clustered hash vectors. A tuple pair is a candidate if their hash values are equal for all the hash coordinates in the group. At, 650, the candidate tuple pairs are compared utilizing a similarity function. The pairs of tuples that are more similar than a predetermined threshold are returned. In one implementation, the similarity function may be a Jaccard similarity function, some variant of the Jaccard similarity function, a cosine similarity function, an edit distance similarity function or the like.
The relevant parameters for the multi-grouping algorithm are g, the size of each group of min-hash coordinates, and f, the number of groups. One can write the total running time for the MG algorithm as: T1+T2+f·g·CB+f·CQ+NC·CC. One can estimate NC in terms of f and g and choose them such that the overall cost is minimized. This is feasible because the value for f is constrained in terms of g, and vice-versa. The values are constrained because the expected number of tuple comparisons performed by the MG algorithm is f·(n2) E[Jaccard(u,v)g]. If θ is the similarity threshold, then with probability at least 1−(1−θg)f, (u,v) is output by the MG algorithm.
Accordingly, the expectation of the number of total candidate pairs is bounded by f·(n2) E[Jaccard(u,v)g]. Using a random sample, we can estimate the expected value of the gth moment of the Jaccard similarity between pairs of tuples. We then choose values for g and f which minimize the overall running time.
Referring now to
Groups of hash vector coordinates are selected such that the total number of candidate tuple pairs to be compared is minimized. In particular, the hash vectors are divided into K groups of hash coordinates, at 720. The groups of hash coordinates may be different for different hash vectors. At 730, the frequencies of the collective hash values are determined for each possible group of hash coordinates. Based upon these frequencies, the groups which minimize the total number of candidate tuples are finalized. The hash vectors are sorted based upon the collective hash values for each of the group of vector coordinates, at 750. Hash vectors having the same hash values for each of the hash coordinates in the select group of hash coordinates will cluster together. At 760, candidate tuple pairs are determined from the clustered hash vectors. A tuple pair is a candidate if their hash values are equal for all the hash coordinates in the group. At, 770, the candidate tuple pairs are compared utilizing a similarity function. The pairs of tuples that are more similar than a predetermined threshold are returned. In one implementation, the similarity function may be a Jaccard similarity function, some variant of the Jaccard similarity function, a cosine similarity function, an edit distance similarity function or the like.
In a smallest bucket with dynamic grouping (SBDM) instantiation, one or more hash coordinates for a particular hash vector are selected as a function of the frequency of hash values of the vector. In particular, the frequencies of hash values are determined for each coordinate of a particular hash vector. The k selected coordinates for the particular vector are coordinates that have smaller frequencies (e.g., smallest bucket), as compared to the vector coordinate having the highest frequency. It is appreciated that vector coordinates having frequencies of one are not selected because they indicate that there is no potential duplicate tuple. The vector coordinates not selected based upon smallest buck size may then be dynamically grouped with one or more of the selected coordinates. The hash vectors are sorted based upon the collective hash values for each of the group of vector coordinates. Hash vectors having the same hash values for each of the hash coordinates in the select group of hash coordinates will cluster together.
Generally, any of the processes for detecting duplicate tuples described above can be implemented using software, firmware, hardware, or any combination of these implementations. The term “logic, “module” or “functionality” as used herein generally represents software, firmware, hardware, or any combination thereof. For instance, in the case of a software implementation, the term “logic,” “module,” or “functionality” represents computer-executable program code that performs specified tasks when executed on a computing device or devices. The program code can be stored in one or more computer-readable media (e.g., computer memory). It is also appreciated that the illustrated separation of logic, modules and functionality into distinct units may reflect an actual physical grouping and allocation of such software, firmware and/or hardware, or can correspond to a conceptual allocation of different tasks performed by a single software program, firmware routine or hardware unit. The illustrated logic, modules and functionality can be located in a single computing device, or can be distributed over a plurality of computing devices.
Although probabilistic techniques for detecting fuzzy duplicate tuples have been described in language specific to structural features and/or methods, it is to be understood that the subject of the appended claims is not necessarily limited to the specific features or methods described. Rather, the specific features and methods are disclosed as exemplary implementations of techniques for detecting fuzzy duplicates of tuples.