Data security and encryption is a branch of computer science that relates to protecting information from disclosure to other systems and allowing only an intended system access to that information. The data may be encrypted using various techniques, such as public/private key cryptography and/or elliptic-curve cryptography, and may be decrypted by the intended recipient using a shared public key and a private key and/or other corresponding decryption technique. Transmission of the data is protected from being decrypted by other systems at least by their lack of possession of the encryption information.
For a more complete understanding of the present disclosure, reference is now made to the following description taken in conjunction with the accompanying drawings.
In various embodiments of the present disclosure, a first system (such as an encrypted data processing component) may receive, from a second system (such as a data requestor) first encrypted data vector representing a text search query. The first system may receive, from a third system (such as a data provider) second encrypted data, the second encrypted data including a first vector and a second vector representing text of an electronic document. The first system may determine third encrypted data representing a first difference between the first encrypted data vector and the first vector of the second encrypted data and fourth encrypted data representing a second difference between the first encrypted data vector and the second vector of the second encrypted data. The first system may then determine fifth encrypted data representing a first product of the third encrypted data and the fourth encrypted data. The first system may send the fifth encrypted data to the third system. The first system may receive, from the third system, first decrypted data representing a value of the fifth encrypted data. The first system may determine first decrypted data satisfies a condition and determine results data for the text search query based on the first decrypted data satisfying the condition. The first system may send the results data to the second system.
Many types of data is private or highly sensitive, such as financial and medical records. While this data must be controlled and measures should be taken to protect data that includes sensitive information such as personally identifiable information (PII), etc. opportunities may exist for learning from the data as a whole but without exposing individual records or PII. A data custodian may be a controller of private data which permits access to the data in the right conditions or sufficient data abstraction.
As data privacy is a major concern for data custodians, providing query or search capabilities over private datasets may be impossible or very difficult. The problem arises, in one example, as data custodians prefer to not make data fully accessible for purposes of operations such as a query or search as this may potentially expose PII or other sensitive information. Solutions may include a data custodian providing limited access of the data to clients that do not need full access to the data, such as clients that wish to extract specific and relevant pieces of information for an application. Examples of such extraction applications may include applications used in the medical and pharmaceutical arenas where patient records must remain private but while also allowing medical research agencies to access information about the frequency of medical conditions or side effects of drugs. Other examples may include private communication applications, such as private email and instant messaging, where the same user needs to search over data which needs to remain encrypted at rest.
An example scenario may concern accessing confidential medical records. A data custodian may manage a database of confidential medical records. The confidential medical records may include patient personal information, such as name, address, and age, as well as including conditions for which the patient received treatment and further described with symptoms, diagnosis information, and treatments including drugs, posology, treatment effects and side-effects. To ensure the data stays private, the medical records may not be stored in plain text. Additionally, a plain text version of a search index for medical records data may reveal private information using simple dictionary attacks.
A pharmaceutical company, or data requestor, may wish to obtain medical record data to conduct research about the side effects of one of their drugs. The data requestor does not have access to the medical records in plain text. However, the data requestor is trying to answer a simple question: Does any patient receiving the particular drug present any symptoms in a known set of symptoms. Thus, the data requestor does not need any patient identifying information. Instead, the data may be limited to information about the medical experiences of patients taking the particular drug to satisfy the research needs.
The data custodian may encrypt the medical records database and may hold the encryption key pair. Additionally, the data requestor's research may be confidential and they may not want to expose the data request or query to the data custodian. For example, if queries by the pharmaceutical company for a particular drug and medical conditions was made public, unfounded speculation may be made about the drug's side effects. Further, the data custodian may rely on a cloud based search engine and delegate hosting the search operations to a third party. Thus third party may also provide non-disclosure of the data requestor's search terms.
Embodiments of the present disclosure thus relate to systems and methods for identifying matches to search terms from the data requestor in the encrypted database of the data custodian. This may include techniques for the data custodian hosting the encrypted database and the data requestor accepts the search query exposure to the data custodian. This may include techniques for the third party, or encrypted data processing component, holding the encrypted database or encrypted search index(es) of the database and receiving encrypted search queries from the data requestors. Thus, limiting exposure between the data custodian and the data requestor.
The method and techniques described herein may use a word embedding to build a representation of the database, or search corpus, which is then encrypted using a Homomorphic Encryption system capable of addition and multiplication, such as a Fully Homomorphic Encryption (FHE) scheme. The encrypted word embedding representation of the search corpus may serve as a search index for performing search operations. This may provide for corpus data privacy as the search corpus is encrypted at all times and query privacy as the query terms are never revealed to the data owner. Additionally, benefits may include delegation of the search operation to a trusted third party that may execute in the cloud and a non-cost prohibitive execution time. The techniques described herein may offer better search performance than other encrypted searches as this utilizes a binary search tree, and thus the search operation cost is logarithmic.
In some embodiments, the data provider component, encrypted data processing component, and/or data requestor and/or query component encrypt and/or de-crypt data in accordance with an encryption technique, such as Rivest-Shamir-Adleman (RSA) encryption, elliptic-curve encryption, or any encryption that is homomorphic (partially or fully); in these embodiments, the components may transmit, to the other components, only encrypted data.
The data provider 124 may control access to data, such as data stored in database(s) 126. The data in the database(s) 126 may be encrypted. The data provider 124 may provide searchable information about the data in the database(s) 126 to the encrypted data processing component 122, such as encrypted search indexes as described in reference to
Each document in the databases of the data provider 124 may be transformed into a bag-of-words representation. The bag-of-words model is text of a document represented as the list of its included words and disregards grammar and word order, but may record the multiplicity of words. Such a bag-of-words representation may then be used to identify words in particular documents for searching purposes.
As shown in
The set of vectors in the embedding space 210 is a result of the function of equation (2), where H is the word embedding transformation and Bagd is the bag-of-words representation of document d.
{right arrow over (v)}=H(s) for s∈Bagd (2)
Through the bag-of-words representation and the embedding transformation 206, such as using the Levenshtein edit distance, each document in the search corpus, such as the text document 202, is transformed into a set of vectors in the embedding space 210. A homomorphic encryption 212 is performed to encrypt the set of vectors in the embedding space 210. The homomorphic encryption 212 may use a public key 214 provided by the data owner, or data provider 124. A set of encrypted vectors 216 may result from performing the homomorphic encryption 212 on the set of vectors in the embedding space 210. Through the use of homomorphic encryption, homomorphic operations may be used that are part of the search operations described in reference to
An encryption function may be defined as ε(Kpublic, Kprivate) where Kpubic, Kprivate may be the public key 214, and corresponding private key, owned by the data provider. Thus, the encryption function for encrypting the set of vectors in the embedding space 210 into the set of encrypted vectors 216 may be expressed as equation (3).
The set of encrypted vectors 216 may be stored by the data provider 124, such as in the database(s) 126. As previously noted, the embedding and encryption process described in relation to
Using the process described in relation to
The searching for matches of a search term in the set of encrypted vectors 216 may be equivalent to finding elements of the set that are equal to the encrypted search vector that results from applying the same encryption and transformation to the search term that was applied to generate the set of encrypted vectors. This is essentially identifying matches between the encrypted search vector and the encrypted vectors of the search index. Using the symbol to denote the encrypted comparison, then searching for a term t is the equivalent of finding {right arrow over (ev)}, such as in equation (4).
∥{right arrow over (et)},{right arrow over (ev)}∥0 (4)
As previously noted, the data query requestor may request a search directly from the data provider 124 if the data requestor 120 accepts that the search query is exposed to the data provider 124. The data provider 124 may receive the search terms from the data requestor 120. The data provider 124 may transform and encrypt the search terms to then perform the comparison operations of the between the encrypted search vector and the stored encrypted vectors corresponding to the text documents. While this does expose the search terms of the data requestor, a concession the data requestor must agree to, the data provider's 124 data is secure because the data at rest stays encrypted. The advantage of such a scenario is that the data provider's 124 data stays encrypted at rest and the data provider does not have to decrypt the data, thus lowering the risk of data leaks.
In another embodiment, the search may be delegated to a third party, or encrypted data processing component 122. The data requestor 120 may prefer to keep their search terms private, such as the example of a pharmaceutical company that does not want to expose possible side effects of a drug. In this embodiment, the data provider 124 may delegate the search operation to a third party, such as the encrypted data processing component 122.
As shown in
The encrypted data processing component 122 may broadcast (308) the public key to clients, such as the data requestor 120, that may want to query the search index of the data provider 124. The encrypted data processing component 122 may generate (310) a random vector {right arrow over (r)} to use as salt and obscure both the search index and the search query. The random vector {right arrow over (r)} may be the same dimension as the encrypted corpus vectors of the search index. The encrypted data processing component 122 may generate (312) a salted encrypted search index for the encrypted search index by computing the inner product of the random vector with encrypted vectors, as represented in equation (5).
{right arrow over (r)}·{right arrow over (eJ)} (5)
The data requestor 120 may identify a set of search terms for searching the corpus of the data provider 124. Using the same embedding transformation and the received public key, the data requestor may generate (314) an encrypted search query {right arrow over (q)} of the same dimension as the corpus, using the identified set of search terms. The data requestor 120 may send (316) the encrypted search query to the encrypted data processing component 122 for executing the search.
As shown in
δj={right arrow over (r)}·{right arrow over (ej)}−{right arrow over (r)}·{right arrow over (q)} (6)
Calculating the difference for each vector of the set of encrypted vectors may result in a set of deltas δ1, δ2, . . . δn. The encrypted data processing component 122 may generate (322) a binary tree from the set of deltas. The encrypted data processing component 122 may generate the binary tree by starting with the deltas δ1, δ2, . . . δn as the leaf nodes of the binary tree and recursively multiplying pairs of deltas to generate the parent brand nodes of the leaf nodes. For example, the parent branch nodes of the leaf nodes of the binary tree may comprise the product pairs δ1·δ2, δ3·δ4, . . . δn-1·δn. The next level above branch nodes may then be calculated as the product of the previous lower branch nodes, such as (δ1·δ2)·(δ3·δ4), . . . (δn-3·δn-2)·(δn-1·δn). The recursion continues until the root node of δ1·δ2· . . . δn.
Upon generation of the binary tree, the encrypted data processing component 122 sends (324) the binary tree to the data provider 124. Using the private key, the data provider 124 may decrypt (326) the nodes of the binary tree. While this may allow for the data provider 124 to identify which nodes indicate a match of the search terms and the search index, because of the multiplication with the salt, or random vector {right arrow over (r)}, the data provider 124 is not able to translate the values of the matched nodes back to the search index. Thus, the search query is kept private from the data provider 124. The data provider 124 may send (328) the decrypted binary tree to the encrypted data processing component 122. The communication between the encrypted data processing component 122 and the data provider may use an encrypted data channel, such as to prevent exposure of the decrypted binary tree.
As shown in
From the traversal of the decrypted binary tree, the encrypted data processing component 122 may identify leaf nodes with a value of zero. If the root node of the decrypted binary tree does not have a value of zero, then no matches may exist between the search query terms and the search index terms.
In some instances, identifying a node with a zero value such that {right arrow over (eJ)}−{right arrow over (q)}=0 for the set of encrypted vectors {right arrow over (eJ)}, j=1 . . . m, may not be sufficient to identify a true term match. The computation of δj={right arrow over (r)}·{right arrow over (ej)}−{right arrow over (r)}·{right arrow over (q)} using the random vector {right arrow over (r)} may increase the likelihood of identifying a true match. By multiplying both the encrypted search query {right arrow over (q)}, and each vector of the set of encrypted vectors {right arrow over (eJ)}, j=1 . . . m by the random vector {right arrow over (r)}, an inference about either may not be made if the decrypted δj is discovered. However, in some embodiments, to increase certainty of the match, many linearly independent vectors r may be used to form a matrix R. The matrix R may have rank between 1 and the number of components in the vectors {right arrow over (eJ)} and {right arrow over (q)}. For example, using a matrix R of full rank, if the calculation of R·({right arrow over (eJ)}−{right arrow over (q)}) is equal zero, then the match is confirmed. This is because there is only a true match when {right arrow over (eJ)}−{right arrow over (q)} is the zero vector as R·({right arrow over (eJ)}−{right arrow over (q)})=R−10=0. In some embodiments, determining the δj, or difference value, is equal to zero may identify a potential match. In some embodiments, multiplying the δj, or difference value, by invertible matrix R may confirm the match.
In some embodiments, the encrypted data processing component 122 may traverse the decrypted binary tree to identify the leaf nodes with a value of zero, or the zero deltas. Each of these identified leaf nodes may be tested using the R matrix to confirm the term match. The R matrix may be used when there are multiple instances that a delta is equal to zero to determine a true match.
The encrypted data processing component 122 translate (332) the identified matches from the decrypted binary tree and, in some embodiments, may confirm the match using the R matrix. The encrypted data processing component 122 may send (334) the results to the data requestor 120, such as whether a true match has been found or not. In some embodiments, data identifying where the match has been found may be transmitted as part of the results data to the data requestor 120. For example, the encrypted data processing component 122 may identify the number of documents in the corpus which had terms that matched the search query terms.
As described in operation 324, the encrypted data processing component 122 may send the binary tree 400 of encrypted values to the data provider 124. As descripted in operation 326, the data provider 124 may decrypt, such as using the private key that corresponds to the public key provided by the data provider 124, the vertices of the encrypted binary tree 400. The decryption of the vertices of the encrypted binary tree 400 may result in values 416, 418, 420, 422, 424, 426, 428. The data provider 124 may construct a binary tree with the decrypted values 416, 418, 420, 422, 424, 426, 428 that correspond to the encrypted values of nodes 402, 404, 406, 408, 410, 412, 414. The data provider 12 may send the decrypted binary tree to the encrypted data processing component 122 as described in operation 328.
As described in operation 330, the encrypted data processing component 122 may traverse the binary tree 400 with the decrypted values. The encrypted data processing component 122 may traverse the binary tree 400 to identify nodes with a decrypted value of zero. The traversal may ultimately result in identifying the leaf nodes of the binary tree 400 with a decrypted value of zero, and thus a positive match for the search term. For example, as shown in
The leaf nodes of the binary tree 400 that have a decrypted value of zero, if any are present, may indicate the matching terms of the search index. As described in operations 332 and 334, the encrypted data processing component 122 may identify the matching terms from the zero value leaf nodes and send the results to the data requestor 120.
In some embodiments, building the binary tree may not be needed to identify an initial match. However, the binary tree serves as a means to optimize the search for leaf nodes with a value of zero, or zero deltas. In some embodiments, caching may be used to optimize the multiplications. In some embodiments, the delta products may be broken down into parts that may be precomputed when the search index is built. For example, equation (7) may represent a delta product.
δ1δ2=Σri1ei1−ri1qiΣri2ei2−ri2qi (7)
Equation (8) may represent further deconstruction of equation (7).
δ1δ2=(Σri1ei1−Σri1qi)(Σri2ei2−Σri2qi) (8)
Equation (9) may represent final deconstruction of equations (7) and (8), such that portions of equation (9) may be isolated for precomputing.
δ1δ2=(Σri1ei1Σri2ei2)−(Σri1ei1Σri2qi)−(Σri1qiΣri2ei2)+(Σri1qiΣri2qi) (9)
Some parts of the equation (9) may be precomputed, such as (Σri1ei1Σri2ei2), Σri1ei1, and Σri2ei2. Precomputing these part may further reduce the number of operations and computations. These operands may be at the first non-leaf node level, or parent branch nodes of the leaf nodes, of the binary tree. For the branch levels above, the same decomposition and precomputations may be performed. Thus, the resulting precomputed operands for each level of the binary tree may be stored in a similar binary tree structure for faster indexing.
The first system (e.g., the encrypted data processing component 122) may receive (532), from the third system (e.g., the data provider 124), second encrypted data, the second encrypted data including a first vector and a second vector representing text of an electronic document. The second encrypted data may be encrypted by the third system using the public-key data. The vectors representing the text of the electronic document may be determined using an embedding transformation applied to the text of the electronic document, which is further detailed with the description in reference to
In some embodiments, the first system (e.g., the encrypted data processing component 122) may generate a random vector, or a salt. Upon receiving the first encrypted data and/or the second encrypted data, the first system may multiply the first encrypted data vector and the random vector to determine a salted first encrypted data vector. For example, the first system may use equation (5) to determine the salted first encrypted data vector from the random vector and the first encrypted data vector.
Further, the first system may multiply the first vector of the second encrypted data and the random vector, as well as multiply the second vector of the second encrypted data and the random vector, to determine salted second encrypted data. For example, the first system may use equation (5) to determine the salted second encrypted data vector by multiplying each vector of the second encrypted data vector and the random vector. In the following operations described in reference to
In some embodiments, the first system (e.g., the encrypted data processing component 122) may determine (534) third encrypted data representing a first difference between the first encrypted data vector and the first vector of the second encrypted data. In some embodiments, the first system may determine (534) fourth encrypted data representing a second difference between the first encrypted data vector and the second vector of the second encrypted data. For example, the first system may use equation (6) to determine the first and second difference.
In response to determining the third and fourth encrypted data, the first system may determine (536) fifth encrypted data representing a first product of the third encrypted data and the fourth encrypted data. In some embodiments, the fifth encrypted data may include a binary tree. The binary tree may comprise at least the third encrypted data, the fourth encrypted data, and the first product. For example, the first product may be the root node of the binary tree with the third encrypted data and the fourth encrypted data as the leaf nodes of the root node. As described in reference to
In some embodiments, the first system (e.g., the encrypted data processing component 122) may send (538) the fifth encrypted data to the third system (e.g., the data provider 124). The third system may determine first decrypted data by decrypting the fifth encrypted data, such as with the private-key data corresponding to the public-key data. In some embodiments, the first decrypted data may represent values corresponding to vertices of the binary tree. As described in reference to
In some embodiments, the first system (e.g., the encrypted data processing component 122) may receive (540) the first decrypted data representing a value of the fifth encrypted data from the third system (e.g., the data provider 124). The first system may determine (542) if the value represented in the first decrypted data satisfies a condition. For example, the first system may determine if the value is zero. The first system may determine (544) results data based on the determination of the value satisfying the condition. For example, if the value does not satisfy the condition, then search query did not match any terms of the electronic document and the results data may indicate that there were not any matches. Conversely, if the value does satisfy the condition, the first system may determine results data indicating a positive match of the search query and at least one term of the electronic document. The first system may send (546) the results data to the second system (e.g., the data requestor 120).
A variety of components may be connected through the input/output device interfaces 602. For example, the input/output device interfaces 602 may be used to connect to the network 170. Further components include keyboards, mice, displays, touchscreens, microphones, speakers, and any other type of user input/output device. The components may further include USB drives, removable hard drives, or any other type of removable storage.
The controllers/processors 604 may processes data and computer-readable instructions and may include a general-purpose central-processing unit, a specific-purpose processor such as a graphics processor, a digital-signal processor, an application-specific integrated circuit, a microcontroller, or any other type of controller or processor. The memory 568 may include volatile random access memory (RAM), non-volatile read only memory (ROM), non-volatile magnetoresistive (MRAM), and/or other types of memory. The storage 606 may be used for storing data and controller/processor-executable instructions on one or more non-volatile storage types, such as magnetic storage, optical storage, solid-state storage, etc.
Computer instructions for operating the server 600 and its various components may be executed by the controller(s)/processor(s) 604 using the memory 608 as temporary “working” storage at runtime. The computer instructions may be stored in a non-transitory manner in the memory 608, storage 606, and/or an external device(s). Alternatively, some or all of the executable instructions may be embedded in hardware or firmware on the respective device in addition to or instead of software.
The above aspects of the present disclosure are meant to be illustrative. They were chosen to explain the principles and application of the disclosure and are not intended to be exhaustive or to limit the disclosure. Many modifications and variations of the disclosed aspects may be apparent to those of skill in the art. Persons having ordinary skill in the field of computers and data processing should recognize that components and process steps described herein may be interchangeable with other components or steps, or combinations of components or steps, and still achieve the benefits and advantages of the present disclosure. Moreover, it should be apparent to one skilled in the art that the disclosure may be practiced without some or all of the specific details and steps disclosed herein.
Aspects of the disclosed system may be implemented as a computer method or as an article of manufacture such as a memory device or non-transitory computer readable storage medium. The computer readable storage medium may be readable by a computer and may comprise instructions for causing a computer or other device to perform processes described in the present disclosure. The computer readable storage medium may be implemented by a volatile computer memory, non-volatile computer memory, hard drive, solid-state memory, flash drive, removable disk, and/or other media. In addition, components of one or more of the modules and engines may be implemented as in firmware or hardware, which comprises, among other things, analog and/or digital filters (e.g., filters configured as firmware to a digital signal processor (DSP)).
Conditional language used herein, such as, among others, “can,” “could,” “might,” “may,” “e.g.,” and the like, unless specifically stated otherwise, or otherwise understood within the context as used, is generally intended to convey that certain embodiments include, while other embodiments do not include, certain features, elements and/or steps. Thus, such conditional language is not generally intended to imply that features, elements and/or steps are in any way required for one or more embodiments or that one or more embodiments necessarily include logic for deciding, with or without other input or prompting, whether these features, elements and/or steps are included or are to be performed in any particular embodiment. The terms “comprising,” “including,” “having,” and the like are synonymous and are used inclusively, in an open-ended fashion, and do not exclude additional elements, features, acts, operations, and so forth. Also, the term “or” is used in its inclusive sense (and not in its exclusive sense) so that when used, for example, to connect a list of elements, the term “or” means one, some, or all of the elements in the list.
Disjunctive language such as the phrase “at least one of X, Y, Z,” unless specifically stated otherwise, is otherwise understood with the context as used in general to present that an item, term, etc., may be either X, Y, or Z, or any combination thereof (e.g., X, Y, and/or Z). Thus, such disjunctive language is not generally intended to, and should not, imply that certain embodiments require at least one of X, at least one of Y, or at least one of Z to each be present. As used in this disclosure, the term “a” or “one” may include one or more items unless specifically stated otherwise. Further, the phrase “based on” is intended to mean “based at least in part on” unless specifically stated otherwise.
This application claims the benefit of and priority to U.S. Provisional Patent Application No. 63/192,811, filed May 25, 2021, and entitled “Private Search,” in the name of Madjid Aoudia et al. The above provisional application is herein incorporated by reference in its entirety.
Number | Date | Country | |
---|---|---|---|
63192811 | May 2021 | US |