The present invention relates to computing functions to facilitate fast and secure lookups on large data sets.
Data matching is a key component in data integration and data quality. Data matching is often performed between two parties with data on common entities. The purpose of matching could be to perform checks or develop deeper insights about those entities. However, sometimes the data in question is sensitive and the parties don't want to share their datasets with each other or a third party to do the match. For example, consider two companies each with its own customer database. For a joint marketing campaign the two companies want to find which individuals are customers of both companies. An easy method to find common customers is for the companies to exchange their databases with each other or to give it to a third party for the match. However, both companies are reluctant to share their customer database with anyone due to concerns around data security and privacy. In some cases, especially if the company belongs to a regulated industry, the privacy regulations prevent the companies from sharing customer data such as PII (Personally Identifiable Information) or PHI (Protected Health Information).
Perfect Hash Functions are used in computing to facilitate fast lookups on large data sets. Perfect Hash Functions are used in applications which require compact hash outputs without collisions. One such example is the Private Information Retrieval (PIR) Protocols, such as described in Privacy Preserving Queries over Relational Databases, F. Olumofin and I. Goldberg, Lecture Notes in Computer Science, Vol. 6205, 2010, pp 75-92. Although perfect hash functions are very efficient, they are usually not suitable for security applications which require a secure hash function. On the other hand cryptographic hash functions are secure but they don't have compact outputs and are not suitable for applications which require collision-free and compact hash outputs. Therefore there is a need for perfect hash functions with compact outputs that have security properties similar to those of cryptographic hash functions. These secure perfect hash functions will help solve problems such as private, matching, such as described in U.S. patent application Ser. No. 14/543,959, the contents of which is hereby incorporated by reference in its entirety, in a more efficient manner.
The present invention alleviates the problems described above by providing a secure perfect hash function that has properties similar to those of cryptographic has functions without compromising features of a perfect hash function such as speed and collision-free outputs.
In accordance with embodiments of the present invention, a cryptographic hash function, such as, for example, SHA-2, is utilized to process the set S and the output is divided Into three sub-outputs of required length, Each output can now be treated as a separate hash function thus giving three hash functions (g(x), f1(x), f2(x)). S is split into r buckets Bi 0≦i<r, using the hash function g. Buckets Bi are permuted in a pseudorandom fashion. For each bucket Bi, a displacement pair (d0, d1) is chosen randomly from the sequence {(0,0), (0,1), . . . , (0, m−1), (1,0), (1,1), . . . , (1, m−1), . . . , (m−1, m−1)}, such that each element of Bi is placed m an empty bin given by (f1(x)+d0f2(x)+d1) mod m. If the displacement pair is not successful, another random pair is tried until a successful displacement is found. The index of this displacement is stored in the sequence. The secure perfect hash function then consists of the data structure that stores these m indexes, This hash function has properties similar to those of cryptographic hash functions, without compromising the attractive features of perfect hash functions such as speed and compact collision free outputs.
Therefore, it should now be apparent that the invention substantially achieves all the above aspects and advantages. Additional aspects and advantages of the invention will be set forth in the description that follows, and in part will be obvious from the description, or may be learned by practice of the invention. Moreover, the aspects and advantages of the invention may be realized and obtained by means of the instrumentalities and combinations particularly pointed out in the appended claims.
The accompanying drawings illustrate presently preferred embodiments of the invention, and together with the general description given above and the detailed description given below, serve to explain the principles of the invention. As shown throughout the drawings, like reference numerals designate like or corresponding parts.
In describing the present invention, reference is made to the drawings, wherein there is seen in
For a complete understanding of the present invention, it is necessary to understand the types of hash functions and the differences between them. A Perfect Hash Function for a set S is a hash function that maps each entry in that set to a set of integers without any collisions, i.e. no two members of the set S are mapped to the same integer by the perfect hash function. Once the perfect hash function to a set S has been created, the hash of any member of S can be evaluated in constant time. An example of a perfect hash function is the. Compressed Hash-and-Displace (CHD) function as described in Hash, Displace and Compress, D. Belazzougui, F. C. Botelho and M. Dietzfelbinge, 17th Annual European Symposium, Copenhagen, Denmark, Sep. 7-9, 2009, pages 682-693. CHD (and many other perfect hash functions) is presented by a data structure which is computed from the input set S and is only valid for the input set S. If the input set changes to S a new CHD (and the associated data structure) will have to be calculated for S′.
A minimal perfect hash function is a perfect hash function that maps n elements of a set S to n consecutive integers. Minimal perfect hash functions are desirable because of their compact representation. All minimum perfect hash functions are also perfect hash functions.
There are several notable differences between Cryptographic Hash Functions and CHD. They are as follows: (i) A CHD hash function is computed from a particular input set S and is only valid for that input set. A cryptographic hash function is not computed from an input set and is therefore valid for any input. ii) A CHD hash function is represented by a data structure whose size is proportional to the size of the input set S. Cryptographic hash functions on the other hand have fixed compact representations which do not depend on the input. (iii) The output of the cryptographic hash functions doesn't leak any information about its input. The output of the CHD hash function leaks some information about its input set S. (iv) The CHD representation i.e., the associated data structure, leaks information about its input set S. Cryptographic hash function representations are not tied to a particular input set.
Traditional Perfect Hash Functions have been designed to produce compact collision-free outputs to facilitate fast lookups on large data sets. They do their job well but are not suitable for security applications which, require hash functions that don't leak information abort their input. Perfect hash functions have some desirable properties that can be used to design privacy and security protocols. For example perfect hash functions like CHD are computed from a specific input set and are only valid for that input set. This property can be used to design efficient private matching protocols which allow two panics to find the intersection of their data sets without sharing their data sets with each other. However, fur the private matching protocol to be secure it is required that the perfect hash function must not leak information about its input set S.
A more detailed description of CHD will now be provided, along with its drawbacks and new enhancements according to the present invention to make it secure fur security and privacy applications. The CHD function maps all elements of a set S to m bins such that no bin has more than one element and m≧|S| (|s| is the size of the set S). The function performs this mapping in two steps. In the first step the CHD function uses a hash function to map elements of S to an intermediate table of size r (or r buckets) where r<|S|. In the second step, for each bucket the CHD function uses independent random hash functions to map elements of that bucket to a table of size m (or m bins) such that no bin has more than one element where m≧|S|S. A more detailed description of the CHD function is as follows:
Referring now to
There are issues with the CHD function, however, such that the data structure of the CHD hash as well as the output of the CHD hash can leak significant information about the input. For example, consider two input sets S and S′ of same size which only differ in one element. Then according to the above function the CHD hash functions computed for the two sets S and S′ will be similar, i.e., the data structures of the two CDH hash functions will be similar. To see why first consider step 32 of
According to the present invention, modifications are made to the CHD function as illustrated in
Referring now to
While preferred embodiments of the invention have been described and illustrated above, it should be understood that these are exemplary of the invention and are not to be considered as limiting. Additions, deletions, substitutions, and other modifications can be made without departing from the spirit or scope of the present invention. Accordingly, the invention is not to be considered as limited by the foregoing description but is only limited by the scope of the appended claims.