The present application relates generally to an improved data processing apparatus and method and more specifically to an improved computing tool and improved computing tool operations/functionality for providing a copy-and-recurse operation to perform fully homomorphic encrypted database query processing.
Fully homomorphic encryption (FHE) is an encryption scheme that enables analytical functions to be run directly on encrypted data while yielding results from the encrypted data that are the same as if the analytical functions were executed on the unencrypted data, also referred to as the plaintext. Such encryption schemes are attractive in cloud-based computing environments as it allows data providers to encrypt their data, and thereby maintain the privacy or secrecy of the data, before providing the encrypted data to cloud services that execute analytical functions on the encrypted data, train machine learning computer models using the encrypted data as training and testing datasets, execute machine learning computer models on the encrypted data, or the like, and generate results that are returned to the data providers. This allows data providers to leverage the computational capabilities and services of cloud-based computing environments without exposing their private data to other parties. FHE is likewise attractive to database providers who can encrypt their data in the database, yet still respond to queries without exposing the sensitive data, such as respond to queries for statistical information.
For example, a data provider, e.g., a hospital, medical insurance company, financial institution, government agency, or the like, may maintain a database of data comprising private data about patients that the data provider does not want exposed outside of its own computing environment. However, the data provider, for various reasons, wishes to utilize the analytical capabilities, machine learning computer models, or the like, of one or more cloud-based computing systems to perform analytical functions, artificial intelligence operations, such as generating insights from classifications/predictions performed by trained machine learning computer models, or the like, on the private data. For example, if the data provider is a hospital and wishes to perform analytics on its patient data, the hospital would like to send the patient data to the cloud-based computing systems for performance of these analytics, which may use specially trained machine learning algorithms and the like. However, the hospital does not want to expose the personally identifiable information (PII) of the patients, e.g., names, addresses, social security numbers, or other types of information that alone or in combination can uniquely identify an individual, as such exposure would not only open the hospital to legal liability, but may also be in violation of established laws of the jurisdiction(s) in which the hospital operates. As a result, using FHE, the hospital may encrypt the data prior to sending the encrypted data to the cloud-based computing system for performance of the analytics functions. The analytics are executed on the encrypted data and the encrypted results are returned. The data provider then unencrypts the encrypted results and obtains the unencrypted results for use by the hospital. At no time in this process does the cloud-based computing system gain access to the unencrypted data and thus, privacy is preserved.
Similarly, the data provider, e.g., the hospital, may wish to allow others to query their encrypted database to gather statistical information without exposing the underlying private or sensitive data. Thus, the hospital may use FHE to perform operations on the encrypted data of the database and generate results without the operations accessing the private or sensitive data in plaintext (unencrypted).
This Summary is provided to introduce a selection of concepts in a simplified form that are further described herein in the Detailed Description. This Summary is not intended to identify key factors or essential features of the claimed subject matter, nor is it intended to be used to limit the scope of the claimed subject matter.
In one illustrative embodiment, a method, in a data processing system, is provided for performing a fully homomorphic encryption operation. The method comprises generating, for a data set in a backend data store, a tree data structure comprising a hierarchy of nodes and edges connecting the nodes in a parent-child relationship. The method further comprises, in response to receiving an encrypted query from a client computing device, executing a search operation using the tree data structure at least by executing a copy-and-recurse computing tool to identify a portion of the tree data structure to which to apply a fully homomorphic encryption (FHE) operation. The copy-and-recurse computing tool copies a subset of nodes of the tree data structure and recurses the search operation into the copied subset of nodes. In addition, the method comprises executing the FHE operation on a portion of the data set, corresponding to the identified portion of the tree data structure, to generate results of the FHE operation. Moreover, the method comprises outputting the results as an encrypted output to the client computing device.
In other illustrative embodiments, a computer program product comprising a computer useable or readable medium having a computer readable program is provided. The computer readable program, when executed on a computing device, causes the computing device to perform various ones of, and combinations of, the operations outlined above with regard to the method illustrative embodiment.
In yet another illustrative embodiment, a system/apparatus is provided. The system/apparatus may comprise one or more processors and a memory coupled to the one or more processors. The memory may comprise instructions which, when executed by the one or more processors, cause the one or more processors to perform various ones of, and combinations of, the operations outlined above with regard to the method illustrative embodiment.
These and other features and advantages of the present invention will be described in, or will become apparent to those of ordinary skill in the art in view of, the following detailed description of the example embodiments of the present invention.
The invention, as well as a preferred mode of use and further objectives and advantages thereof, will best be understood by reference to the following detailed description of illustrative embodiments when read in conjunction with the accompanying drawings, wherein:
Many problems in computing systems, database systems, and the like, can be reduced to a range searching operation where a finite set of points P⊂d and a volume (or range) γ⊂
d are given, and one wishes to find P∩γ quickly. That is, with a database with d features, each record is considered a point (r1, . . . , rd)∈
d. A range in this d-dimensional space is defined as a volume where a point matches a query if, and only if, the record is in the volume. In such cases, at least for some database queries, these database queries can be stated as a range (volume) query problem. For example, the query (α1≤r1≤b1) ∧ . . . ∧(αd≤rd≤bd) can be stated as an axis-parallel hyperbox range.
As a simple example, consider a location computing service provided by one or more computing devices, where P⊂2 can be the locations of vehicles, e.g., ice cream trucks, ride share vehicles, etc., and γ can be a small area centered at a person's location. Thus, P∩γ is the set of vehicles in a walkable distance. In some cases, it is important to keep γ private. For example, if the person is a child, then that child's parent may wish to keep the child's location private, or the person themselves may want to keep their own location private from other entities. In addition, the party operating the vehicle, e.g., the ice cream truck company, may wish to avoid knowing its client's location fearing liability if the company's servers are breached. Thus, it is beneficial to have the data and queries encrypted to protect the privacy of the parties involved. Moreover, since the locations of the vehicles change continuously, their locations cannot be downloaded offline, and downloading the entire location database is inefficient, especially if it is expected that the service is accessed over a large number of mobile computing devices, as is commonplace in modern wireless distributed computing systems. In addition, downloading the entire location database also introduces similar privacy concerns as there are multiple copies of the database in such instances and access to the location data is less restricted once downloaded.
Fully Homomorphic Encryption (FHE) allows analytical functions to be performed on private data without divulging the private data itself, without the need for trusted third parties or intermediaries, and without having to mask or drop features in order to preserve privacy of the data, e.g., through replacement of personally identifiable information (PII) with generic privacy preserving representations of the PII, or other modifications or replacements of private data. When keeping private data in a database in the cloud, the data in the database, and the queries on the database, are encrypted to protect the privacy of the data. This encryption may be accomplished using FHE mechanisms. With such systems, the private data is not exposed outside the database, but queries directed to particular types of evaluations may be applied against the encrypted database to obtain useful information without exposing the private data. For example, a query may want to identify a particular number of entries in the database that match criteria specified in the query. The query is encrypted, as is the private data, but the FHE mechanism is able to generate the query results, e.g., the count of the number of matching entries in the database, and return the count without exposing the encrypted contents of the query or the encrypted private data in the database.
With existing FHE mechanisms, when responding to an encrypted database query, it is necessary that the FHE mechanism go over every record in the database array to determine which records match the criteria of a query as there is no ability to determine which subsets of records may be excluded from the evaluation due to the encrypted nature of the data and the query. That is, every record in the database array is evaluated and, for a counting query as an example, the record is counted if it meets the criteria of the query. This is referred to as the “naïve” approach or implementation as it effectively checks for every point p∈P whether it is contained in γ for a total of O(n·t) operations, where n=|P| and t is the time to check whether p∈γ. It is readily apparent that this naïve approach quickly becomes impractical even for medium size databases.
In plaintext, some solutions avoid checking each point or record in a database, by utilizing functions to group points together, checking whether the entire group is contained in the range (or volume) γ, and recursively continuing only in groups that are partially contained in the specified range γ. These solutions rely heavily on branching in the code. This is something that is impossible under FHE and secure range searching problems. When running under FHE, an algorithm is unable to compare values or compute only one branch of code according to the comparison. Instead, each condition is replaced with a polynomial c(·) whose variables are the input to the comparison and whose value is 1 or 0 depending on the compared values. Branching is then replaced by computing both branches, multiplying one branch by the polynomial c, and the other branch by (1−c). This effectively computes both branches of code. This is why when prior works traverse a tree under FHE, they effectively visit all nodes in the tree. Thus, solutions that work well in plaintext do not extend well to FHE.
The illustrative embodiments provide an improved computing tool and improved computing tool functionality/operations that address tree traversal under FHE and more specifically, in some illustrative embodiments, traversing a partition tree to solve the secure range searching problem under FHE. For the secure range searching problem, a solution is provided that uses
operations, where t is the operations needed to compare a range to a simplex and ϵ>0 is a parameter chosen when implementing the solution. Choosing a small value for ϵ reduces the value of
but there is a multiplicative factor that depends on ϵ that increases when ϵ decreases. The value of ϵ can be arbitrarily small, depending on the desired implementation, but the value of epsilon should be selected taking into consideration the multiplicative factor as well.
Since in practice comparing the range to various objects is what dominates the running time in many cases the improved computing tool and improved computing tool functionality outperforms the naive solution, i.e., checking each record, which takes O(n*t) to compute. It is noted that under FHE the problem has a lower bound of Ω(n) since there exists a reduction from the private information retrieval (PIR) problem to the range searching problem. It should also be noted that the best known plaintext solution, with storage of size linear in n, takes
time using a data structure referred to as “partition trees” and thus, the secure range searching problem solution of the illustrative embodiments is within O(n) time of the plaintext solution.
With the illustrative embodiments, the improved time bounds achieved by the partition tree using the copy-and-recurse computing tool described herein, may be at least partially attributed to the illustrative embodiments' use of partition trees to solve range searching together with the improved copy-and-recurse functionality that is specific to the present invention. That is, in accordance with some illustrative embodiments, the mechanisms of the illustrative embodiments generate a partition tree for a set of data and then perform a copy-and-recurse based secure range searching operation on the partition tree to find the groups of data points or records of a database that match criteria of a secure query using range searching mechanisms. While the illustrative embodiments will make reference to partition tree data structures for a set of data, where nodes represent a subset of the set of data and edges connect the nodes in a parent-child relationship, the illustrative embodiments are not limited to such and may be applied to other tree data structures where a property of the tree data structure is that at most x children need to be recursed into, where x is a predetermined value. For example, the tree data structure may be a decision tree data structure where each node is associated with a condition that needs to be checked and each leaf node is associated with a label. Other tree data structures may likewise be a basis for the mechanisms of the improved computing tool and improved computing tool functionality of the illustrative embodiments without departing from the spirit and scope of the present invention.
Assuming a partition tree data structure embodiment, the copy-and-recurse computing tool allows the range searching process to traverse partition trees efficiently under FHE. Specifically, when traversing a r-ary partition tree (i.e., each inner node has r children) that has a bound ξ<r on the number of children the process needs to recurse into, the copy-and-recurse computing tool and functionality copies ξ children and their subtrees (under FHE) and recurses only into the copied children. The choice of r and the bound ξ determine the value of ϵ. This copy-and-recurse computing tool, executed on partition trees, solves the range searching problem, such as those used in FHE based mechanisms, e.g., encrypted databases and encrypted queries. However, the mechanisms of the illustrative embodiments are applicable to other tree based solutions as well without departing from the spirit and scope of the present invention as noted above.
In some illustrative embodiments, the copy-and-recurse functionality efficiently traverses a full r-ary tree (i.e., each inner node has r children) with n leaves, where there is a bound ξ on the number of children that need to be recursed into at each node. Here r is a parameter 0<r<n and the traversing complexity depends on r and ξ. As an overview, the copy-and-recurse functionality traverses a partition tree by, when visiting a node, determining (under FHE) which children need to be recursed into, copying ξ children and their subtrees to a buffer, and then continuing recursion into the copies of ξ children. Range searching queries can be answered using the partition trees which comply with the requirements of the r-ary tree where at most
children need to be recursed into, where r is the number of children, d is the dimension of the space the problem is defined in.
The range searching based operations, e.g., counting a number of matching records in a database, reporting matching records in a database, and the like, may be generalized as operations that output ƒ(P∩γ) for a large set of functions. Specifically, any function ƒ, where there exists another function g such that ƒ(A∪B)=g(ƒ(A)ƒ(B)), where A, B⊂P and A∩B=Ø. This means that the mechanisms of the illustrative embodiments can be applied to compute any function ƒ( ) that can be applied in a divide-and-conquer way, i.e., by splitting the set (A∪B) it is applied on to 2 sets A and B, compute f(A) and f(B) and then join the outputs to get ƒ(A∪B). The illustrative embodiments are able to perform such range searching based operations when processing queries while preserving privacy, such as via FHE mechanisms. In some illustrative embodiments, the improved computing tool functionality may be implemented with a homomorphic encryption HElayers library, available from IBM Corporation of Armonk, New York, to write packing-oblivious code and the Homomorphic Encryption for Arithmetic of Approximate Numbers (HEAAN) software library as the FHE scheme, although other HE and FHE libraries and mechanisms may be used without departing from the spirit and scope of the present invention.
Before continuing the discussion of the various aspects of the illustrative embodiments and the improved computer operations performed by the illustrative embodiments, it should first be appreciated that throughout this description the term “mechanism” will be used to refer to elements of the present invention that perform various operations, functions, and the like. A “mechanism,” as the term is used herein, may be an implementation of the functions or aspects of the illustrative embodiments in the form of an apparatus, a procedure, or a computer program product. In the case of a procedure, the procedure is implemented by one or more devices, apparatus, computers, data processing systems, or the like. In the case of a computer program product, the logic represented by computer code or instructions embodied in or on the computer program product is executed by one or more hardware devices in order to implement the functionality or perform the operations associated with the specific “mechanism.” Thus, the mechanisms described herein may be implemented as specialized hardware, software executing on hardware to thereby configure the hardware to implement the specialized functionality of the present invention which the hardware would not otherwise be able to perform, software instructions stored on a medium such that the instructions are readily executable by hardware to thereby specifically configure the hardware to perform the recited functionality and specific computer operations described herein, a procedure or method for executing the functions, or a combination of any of the above.
The present description and claims may make use of the terms “a”, “at least one of”, and “one or more of” with regard to particular features and elements of the illustrative embodiments. It should be appreciated that these terms and phrases are intended to state that there is at least one of the particular feature or element present in the particular illustrative embodiment, but that more than one can also be present. That is, these terms/phrases are not intended to limit the description or claims to a single feature/element being present or require that a plurality of such features/elements be present. To the contrary, these terms/phrases only require at least a single feature/element with the possibility of a plurality of such features/elements being within the scope of the description and claims.
Moreover, it should be appreciated that the use of the term “engine,” if used herein with regard to describing embodiments and features of the invention, is not intended to be limiting of any particular technological implementation for accomplishing and/or performing the actions, steps, processes, etc., attributable to and/or performed by the engine, but is limited in that the “engine” is implemented in computer technology and its actions, steps, processes, etc. are not performed as mental processes or performed through manual effort, even if the engine may work in conjunction with manual input or may provide output intended for manual or mental consumption. The engine is implemented as one or more of software executing on hardware, dedicated hardware, and/or firmware, or any combination thereof, that is specifically configured to perform the specified functions. The hardware may include, but is not limited to, use of a processor in combination with appropriate software loaded or stored in a machine readable memory and executed by the processor to thereby specifically configure the processor for a specialized purpose that comprises one or more of the functions of one or more embodiments of the present invention. Further, any name associated with a particular engine is, unless otherwise specified, for purposes of convenience of reference and not intended to be limiting to a specific implementation. Additionally, any functionality attributed to an engine may be equally performed by multiple engines, incorporated into and/or combined with the functionality of another engine of the same or different type, or distributed across one or more engines of various configurations.
In addition, it should be appreciated that the following description uses a plurality of various examples for various elements of the illustrative embodiments to further illustrate example implementations of the illustrative embodiments and to aid in the understanding of the mechanisms of the illustrative embodiments. These examples intended to be non-limiting and are not exhaustive of the various possibilities for implementing the mechanisms of the illustrative embodiments. It will be apparent to those of ordinary skill in the art in view of the present description that there are many other alternative implementations for these various elements that may be utilized in addition to, or in replacement of, the examples provided herein without departing from the spirit and scope of the present invention.
Various aspects of the present disclosure are described by narrative text, flowcharts, block diagrams of computer systems and/or block diagrams of the machine logic included in computer program product (CPP) embodiments. With respect to any flowcharts, depending upon the technology involved, the operations can be performed in a different order than what is shown in a given flowchart. For example, again depending upon the technology involved, two operations shown in successive flowchart blocks may be performed in reverse order, as a single integrated step, concurrently, or in a manner at least partially overlapping in time.
A computer program product embodiment (“CPP embodiment” or “CPP”) is a term used in the present disclosure to describe any set of one, or more, storage media (also called “mediums”) collectively included in a set of one, or more, storage devices that collectively include machine readable code corresponding to instructions and/or data for performing computer operations specified in a given CPP⊂laim. A “storage device” is any tangible device that can retain and store instructions for use by a computer processor. Without limitation, the computer readable storage medium may be an electronic storage medium, a magnetic storage medium, an optical storage medium, an electromagnetic storage medium, a semiconductor storage medium, a mechanical storage medium, or any suitable combination of the foregoing. Some known types of storage devices that include these mediums include: diskette, hard disk, random access memory (RAM), read-only memory (ROM), erasable programmable read-only memory (EPROM or Flash memory), static random access memory (SRAM), compact disc read-only memory (CD-ROM), digital versatile disk (DVD), memory stick, floppy disk, mechanically encoded device (such as punch cards or pits/lands formed in a major surface of a disc) or any suitable combination of the foregoing. A computer readable storage medium, as that term is used in the present disclosure, is not to be construed as storage in the form of transitory signals per se, such as radio waves or other freely propagating electromagnetic waves, electromagnetic waves propagating through a waveguide, light pulses passing through a fiber optic cable, electrical signals communicated through a wire, and/or other transmission media. As will be understood by those of skill in the art, data is typically moved at some occasional points in time during normal operations of a storage device, such as during access, de-fragmentation or garbage collection, but this does not render the storage device as transitory because the data is not transitory while it is stored.
It should be appreciated that certain features of the invention, which are, for clarity, described in the context of separate embodiments, may also be provided in combination in a single embodiment. Conversely, various features of the invention, which are, for brevity, described in the context of a single embodiment, may also be provided separately or in any suitable sub-combination.
As discussed above, the illustrative embodiments provide an improved computing tool and improved computing tool functionality/operations for performing privacy preservice range searches. That is, the illustrative embodiments provide an improved computing tool and improved computing tool operations/functionality for providing a copy-and-recurse operation on partition tree data structures to perform fully homomorphic encrypted query processing that provides improved performance over computer solutions that require evaluating each individual data point, or record in a database, to process encrypted queries of encrypted data.
As shown in
The cloud service 230 may provide any suitable computing service to the client computing devices 210, 212, which may be wired or wireless client computing devices 210, 212, e.g., desktop computers, mobile computerized communication devices, tablet computers, vehicle mounted computing devices, or the like. For example, the cloud service 230 may provide a location service (see example discussed with regard to
The cloud service 230 is one that utilizes FHE to protect the private data of the backend data store 250 provider as well as the privacy of the queries submitted by the client computing devices 210, 212. The cloud service provider system 230 implements a FHE enabled search engine 240 to process encrypted search queries from client computing devices 210, 212 and provide encrypted results which may then be decrypted at the client computing devices 210, 212 without exposing the backend data stored in the backend data stores 250.
The FHE enabled search engine 240 is improved by the mechanisms of the illustrative embodiments to provide a partition tree based range search capability that improves the way in which FHE queries are processed, as will be described in greater detail hereafter, with regard to specifics of the various algorithms implemented in the depicted components. The FHE enabled search engine 240 includes a partition tree generator 242 which provides logic for generating one or more partition trees corresponding to the data, or portions of the data, stored in the backend data stores 250 (see discussion hereafter with regard to
The partition tree range search engine 244 may invoke operations of the copy-and-recurse engine 246 to perform copying-and-recurse on child nodes of the partition tree, as described hereafter, to facilitate range searching under FHE. While not shown in
The FHE enabled search engine 240 operates on an encrypted search query and one or more partition tree data structures representing the backend data, to perform range searches for operations corresponding to the received encrypted search query. The actual FHE operations themselves may be performed by the FHE engine 249, with the other components operating to govern which portions of data in the backend data store 250 need to be evaluated with FHE operations, in a manner that is more efficient than checking every data point or record in the backend data stores 250. That is, the FHE enabled search engine 240 may receive an encrypted query from a client computing device 210 which requests an operation that can be represented as a range search operation in FHE, e.g., a counting operation, a reporting operation, an averaging operation, a k-means clustering operation or the like. The FHE enabled search engine 240 performs range searches on partition trees using a copy-and-recurse functionality as described herein, to identify child nodes and sub-trees that should be recursed into such that FHE operations by the FHE engine 249 need only be performed on selected sub-portions of the data in the backend data stores, rather than all data points or all records in the database. Once the results of the FHE operations on the selected sub-portion are completed, a response to the original encrypted query is generated and the cloud service provider system 230 may return encrypted results to the client computing devices 210, 212.
Before discussing in detail the partition tree and copy-and-recurse operation based mechanisms of the illustrative embodiments, it is first beneficial to discuss fully homomorphic encryption operations, notations, and computational geometry terminology that are used in this description. Fully homomorphic encryption (FHE) is an asymmetric encryption scheme that supports addition and multiplication operations on ciphertexts. More specifically, a FHE scheme is the tuple E=(Gen, Enc, Dec, Add Mult), where:
Correctness in FHE is the requirement that m=m′, i.e., the decrypted message is the same as the original message m, c=a+b mod p and d=a·b mod p, i.e. that one can apply add/multiply on two ciphertexts and get the sum/product of their messages. In an approximated FHE approach, it is required that m≈m′, c≈a+b mod p and d≈a·b mod p. The notation [[·]]g is used to denote a ciphertext, but when pk (public key) is clear from the context it may be omitted. The following abbreviated syntax will be used in this description:
Using these operations one can construct any arithmetic circuit (model for computing polynomials) and compute any polynomial (x1, . . . ) on the ciphertexts x1, x2, . . . etc. For example, in a client-server computing system, the client may encrypt their data and send it to the server to compute a polynomial
on the en input. The output is also encrypted and is returned to the client, as mentioned above with regard to
With regard to computational geometry terminology and concepts used herein, it is important to understand the concepts of range space, range searching, algebraic range, semi-algebraic range, constant description complexity, elementary cell partition (or simplicial partition), crossing number, and the partition theorem. These are described hereafter as follows.
Range space: A range space is a pair (X,Γ), where X is a set (of anything, e.g., in d for some d, where again d is the dimensional space of the data points, e.g., database records in the backend data stores 250 in
Range Searching: The range searching problem studied in computational geometry is as follows: given a set of n points P⊂d and a family of ranges Γ, preprocess P into a data structure
, such that given a range γ∈Γ, and using
, |P∩γ| can be efficiently computed. In the example of
Algebraic range: A d-dimensional algebraic range is a subset γ⊂d defined by an algebraic surface given by a function that divides
d into two regions (e.g., above and below). This function is also denoted as γ.
Semi-algebraic range: A d-dimensional semi-algebraic range is a subset γ⊂Rd that is a conjunction and disjunction of a bounded number of algebraic ranges. Simply put, a semi-algebraic range is the result of intersections and unions of algebraic ranges.
Constant description complexity: The description complexity of a range is the number of parameters needed to describe it. For example, the half-space range bounded by a plane in 3 ax+by+cz+1=0 has 3 parameters, a, b, and c. The description complexity can be large. For example, a star-shaped volume in
3 with n “spikes” has a description complexity of O(n). In some illustrative embodiments, the mechanisms of the illustrative embodiments operate based on ranges that have constant description complexity.
Elementary Cell Partition (or Simplicial Partition): Given a set P⊂d of n points, an elementary cell partition (or simplicial partition) is a collection Π={(P1,σ1), . . . , (Pm, σm)} where Pi's are disjoint subsets such that ∪Pi=P and each Pi⊂σi, where σ1 is a simplex. The size of the partition is m.
Crossing number: Given a simplicial partition Π={(P1,σ1), . . . , (Pm,σm)} and a range γ, the crossing number of γ with respect to Π is the number of simplices γ crosses, i.e., |{σi|σi∩γ6=σi and σi∩γ=6Ø, for i=1,2, . . . ,m}|.
Partition Theorem (Theorem 1): Given a set P of n points in d, for some fixed dimension d, a family of ranges Γ, and a parameter r≤n, an elementary cell partition Π={(P1,σ1), . . . , (Pm,σm)} can be computed in randomized expected time O(nr+r3) such that:
Having set forth some terminology and concepts of range searching and computational geometry above, a problem addressed by the mechanisms of the illustrative embodiments, e.g., FHE enabled search engine 240 in d, a family of ranges Γ and a function ƒ, a data structure is to be constructed that computes efficiently ƒ(P∩γ) for any γ∈Γ. The function ƒ is a function ƒ: 2P→D that maps a subset of P to a valued∈D from a domain D. In the solution to this problem it is also required that a function g: D×D→D, where ƒ(A∪B)=g(ƒ(A)ƒ(B)), for A,B⊆P and A∩B=Ø. In other words, any function may be computed on P∩γ by dividing the function into two subsets and then joining the 2 sub-results. The definition of ƒ, g and D vary with the underlying problem. Several use cases for ƒ and g are:
With this problem statement in mind, the illustrative embodiments consider a security model with four primary parties: (1) the key owner; (2) the owner of the data set P⊂d who also builds the partition tree; (3) the owner of a range γ∈Γ and (4) the cloud. A single entity may operate as multiple ones of these parties, or there may be separate entities for each party. However, it is required that the key-owner and the cloud do not collude.
For example, Alice may be a key owner and a range owner, an ice cream truck company, or ride share company (using the previously mentioned examples) may be the data owner, and a cloud service provider, e.g., International Business Machines (IBM) Corporation, may be the cloud party. In another example, a hospital may be the key owner and data owner (of medical data, an analytics company may be the range owner (e.g., being able to predict people who are in risk of a cardiovascular arrest), and a web services platform provider, e.g., again IBM, may be the cloud party. Further example scenarios include:
In the example of
In some illustrative embodiments, the data owner encrypts their partition tree and uploads the encrypted partition tree to the cloud. The range owner encrypts their range and uploads it to the cloud. The cloud, e.g., cloud service provider system 230, then computes ƒ(P∩γ) under FHE and sends the results of the computation to the key owner, e.g., client 210, 212, to be decrypted. The semantic security of FHE guarantees that the cloud learns nothing on the content of γ and P. This range searching process is outlined in
As noted above, the illustrative embodiments operate on tree data structures, such as partition trees, which may be generated for backend data stores 250 by the partition tree generator 242, for example. In the description of partition trees herein, v is used to denote a node in the partition tree and dot (“.”) denotes members of v. Thus, for example, v.child[1], . . . ,v.child[m] are the children 1 to m of v. The root of the partition tree is referred to as root and the height of a node v is the maximal number of nodes on the path from the node v to the root.
Each node v in a partition tree is associated with a subset subset Sv⊆P. This subset is not kept as a field of v. Each node in the partition tree keeps these fields as attribute data for the node:
A partition tree additionally has these properties:
The children of v are derived from a simplicial partition of Sv, that follows from the partition theorem (Theorem 1). For a simplicial partition Π={(P1,σ1), . . . , (Pm,σm)} of Sv set v to have m children with:
For any inner node v, every range γ∈Γ intersects at most ξ of its children's simplices v.child[1].σ, . . . , v.child[m].σ. The value of ξ depends on Γ and on how Sv is partitioned.
The partition tree may be built using different constructions for the particular use cases and implementations. As examples, partition tree constructions for three different use cases may include: (1) ranges that are 1 dimensional, i.e., γ={x∈R|a<x<b}; (2) general semi-algebraic ranges in d; and (3) ranges that are hyperboxes parallel to the axes γ={(x1, . . . ,xd)∈Rd|a1<x1<b1∧ . . . ∧ad<xd<bd}.
In the first case, when d=1, P=(p1, . . . ,pn) where p1< . . . <pn∈R, Γ is the family of all segments and ƒ, g functions are as mentioned above. Also let 0<r<n be a parameter. To build the partition tree, the process starts by setting v=root and Sv=Sroot=P. Then, the process performs the following operations:
An important property of a partition tree is that every range γ∈Γ intersects at most ξ simplices from v.child[1].σ, . . . ,v.child[m].σ. If the range intersects a simplex (i.e., is not contained and does not avoid the simplex), then it is not known if the points inside the simplex are all inside the range or all outside the range, and thus, the process needs to recurse into the subtree for that simplex. For the 1-dimensional case, ξ=2, as follows from the following lemma:
When building partition trees for dimensions greater than 1, let P⊂d be a subset with n points where d>1, Γ be a range family and ƒ a function as mentioned above. Also let r>0 be a parameter. The process starts by setting v=root and Sv=Sroot=P. Then, the process involves the following operations:
Given a partition tree, such as may be generated by the partition tree generator 242 in and Γ is the family of all segments (in one dimension, a range with constant description complexity is referred to as a segment). In the first portion 410 there is shown 9 points p1< . . . <p9∈
and with the dashed line 412 illustrating the boundaries of a segment γ={x|a<x<b}, where p1<a<p2 and p7<b<p8.
In the second portion 420, there is illustrated the partition tree 420 built for P={p1, . . . , p9}. Next to each node v is shown v.ƒ=ƒ(Sv) and v.σ=[pi,pj] (to improve readability this is omitted for some leaves), where v.ƒ is the function applied on the subset of points that the node represents and v.σ is the simplex that contains the points the node represents (as computed by the partition theorem). When comparing (under FHE) γ to the 3 children of root, it is determined that root.child[1].σ and root.child[3].σ cross γ (by construction it is guaranteed that at most 2 simplices cross γ) and root.child[2].σ is contained in γ. Therefore, ƒ(p4,p5,p6) can be added to the output by taking o′=g(ƒ(p4,p5,p6),o). Using the copy-and-recurse operation of the illustrative embodiments, the left and right children of the root (under FHE) are copied into a buffer and the operation recurses into then.
In the third portion 430 of
It should be appreciated that these additions or copying operations may be accomplished by the copy and recurse operation based on a copy-and-recurse matrix built by the copy-and-recurse matrix generator 248 using a procedure referred to as BuildCopyAndRecurseMatrix (also referred to as Algorithm 4), as discussed hereafter with reference to
Keeping a secondary structure at each node may be used to answer range queries when the ranges are conjunctions of algebraic ranges. Under FHE, a secondary data structure may be kept in each node v constructed v.D. Since the size of the partition tree is near linear, keeping this secondary structure does not change the size complexity of the primary data structure. Using a secondary data structure, range search queries can be answered more efficiently when Γ is a family of a conjunction of ranges. For example, consider hypercubes i.e., γ={x∈Rd|ai≤xi≤bi for i=1, . . . ,d}. To construct the hypercube, a primary data structure for ranges γ1={x∈Rd|a1<x1<b1} is generated with a secondary data structure for ranges γ2={x∈Rd|a2<x2<b2}, etc. Here, instead of having a one-structure tree of points in d, there is a d-structure partition tree where each structure is of points in
. This leads to improved circuit size, where the circuit size is O(n+t*nϵ) instead of
This case is especially interesting because this improves the circuit size of many database queries.
The partition trees may be transformed into partition trees that are “FHE-friendly” partition trees that can be more efficiently used with FHE. An example algorithm for transforming such partition trees into FHE friendly partition trees is shown in
In a partition tree T that is built for a set P, inner nodes may have different numbers of children, and leaves may be at different distances, or heights, from the root. This happens since Theorem 1 generates a simplicial partition of size r/h<m<r at each node. Since the number of children m depends on P, the structure of T may leak information on P. It is therefore, important to hide the structure of the partition tree T. This raises 2 issues: (1) Security: The structure of an unbalanced, non-full tree may leak information on the input; and (2) Input oblivious: To be input oblivious the parameters of the nodes visited when traversing the tree (e.g., number of children) must not leak.
To address these issues, the partition tree may be converted to a full tree, where a full tree is one where each node has a maximal number of children and each leaf node is at the maximal height or distance from the root node. The conversion or transformation may be accomplished by adding empty nodes to the initial partition tree in accordance with a methodology of the illustrative embodiments, such as the example FillTree algorithm, also referred to as Algorithm 2, shown in
To transform a partition tree to a FHE-friendly partition tree, empty nodes are repeatedly added until there is a full partition tree. To see the maximal span of inner nodes and maximal height of leaf nodes, recall that when Theorem 1 partitions Sv into a simplicial partition Π={(P1,σ1), . . . , (Pm,σm)}, with Sv/r≤Pi<h(Sv/r), where h is a constant that depends on Γ. In this case, r/h≤m≤r, where the extremes are when |Pl|=h*Sv/r for all i and when |Pi|=Sv/r, for all i. The height of a leaf, v, is therefore [log,n]≤height(v)≤[logr/hn] where the extremes are when |Pi|=Sv/r, for all i and all nodes v, and when |Pi|=h*Sv/r, for all i and all nodes. The maximal span is therefore r and the maximal height is [logr/hn]. Thus, it follows that to hide the structure of a partition tree T, empty nodes need to be added until (1) all inner nodes have r children and (2) the distance from the root, i.e. the height, to each leaf is logr/hn where h is a constant that comes from the partition theorem (Theorem 1). An example algorithm, i.e., the FillTree algorithm (Algorithm 2) is shown in
Lemma 2 (Height and Spanning Number). Let P be a set of n points in Rd, Γ a family of ranges r<n and h a parameter such that any simplicial partition of P0 with respect to Γ. Π={(P′1,σ1), . . . , (P′m,σm)} satisfies |Pl0|/r<|Pl0|<h·|Pl0|/r and let T=FillTree(T0,n,r,h), where T0 is a partition tree built for P and Γ, then the height of T is logr/hn and it has a total of
nodes.
Proof. From the partition theorem (Theorem 1), at each node v there is a partition with |Pi|≤h·nv/r. It follows that the height of the tree is at most [logr/hn]. The number of children at each node is at most r. The number of nodes is therefore
Lemma 3. Let P1,P2⊂Rd be 2 sets of points with in |P1|=|P1|=n and T′1, T′2 being 2 partition trees built for P1 and P2 with the same parameters r, h then Tt and T2 have the same structure, where Tl=FilTreen,r,h(Ti′).
Proof. The number of children in each node of T1′ and T2′ is at most r for both trees and does not depend on P. In addition, the height of T1′ and T2′ is at most [logr,hn]. Since the FillTree algorithm adds nodes to have a full tree of height [logr/h n] where each inner node has exactly r children T1 and T2 have the same structure.
Lemma 3 guarantees that the structure of T=FillTree(n,r,h,T′) does not leak information on P. However, the content of fields of nodes in T may still leak information on P. Specifically, v.ƒ and v.σ leak information on P. In a privacy preserving application, v.ƒ and v.σ are encrypted for every node in T. The notation [[v.ƒ]] and [[v.σ]] are used to denote these values are encrypted. Applying the FillTree algorithm on a partition tree and then encrypting v.ƒ and v.σ guarantees that T does not leak data on P.
The first function, IsContainingd,Γ(σ,γ): This function gets as input encrypted simplex and range, σ,γ⊂d, where d is a parameter and the range is taken from a family of ranges, γ∈Γ. The value of IsContaining is a ciphertext c, where c=1 if σ⊆γ and c=0 otherwise. The implementation details of IsContainingd.Γ depend on d and Γ. Consider, for example, the case d=2 and Γ being the set of all axis-parallel rectangles. In this case, σ is a triangle given by 3 endpoints (σ.ax,σ.ay,σ.bx,σ.by,σ.cx,σ.cy), and γ={p∈R2|γ.ax<px<γ.bx and γ.ay<py<γ.by} is an axis-parallel rectangle given by its endpoints (γ.ax,γ.ay) and (γ.bx,γ.by). Then, a range γ contains a simplex iff (if and only if) it contains all 3 corners (this follows from the convexity of γ and σ). Checking this condition can be implemented as:
The second function, IsCrossing(σ,γ): This function gets as input an encrypted simplex and range, σ,γ⊂d, as above. The value of IsCrossing is a ciphertext c, where c=1 if γ crosses σ (i.e., intersects but not contains) and c=0 otherwise. The implementation details of IsCrossing depend on d and Γ. Continuing the example above, IsCrossing can be implemented as:
It should be noted that, when computing these functions under approximated schemes (such as CKKS), special care should be given to the precision. Specifically, since CKKS is an approximated scheme, the value of these function is 1+ϵ, where |ϵ|<ϵ and e depends on the complexity of γ and the parameters of the key. The value e can be made arbitrarily small using noise cleaning techniques, that have additional cost. This may incur additional costs to the running time.
In addition, since a polynomial approximation function may be used, functions such as IsSmaller have a correct value only within a specific range. For example, IsSmaller equals an arbitrary value IsSmaller(a,b)∈[−ϵ, 1+ϵ] when |a−b|<δ, for some constant δ which can be made arbitrarily small by taking a higher degree polynomial for IsSmaller. Another way to compute these functions is to (1) use scheme switching to exact schemes (such as BGV, BFV or TFHE); or (2) perform the comparison in exact schemes and then switch back to CKKS.
With regard to complexity, again it should be noted that the description complexity of σ is O(1). Once Γ is fixed, the description complexity of γ is also constant, however, computing these functions depends on the complexity of the ranges. For example, tetrahedron ranges (γ={x∈R3|x*a1<b1 and x*a2<b2 and x*a3<b3}, where a1,a2,a3∈R3 and b1,b2,b3∈R are the parameters defining γ and x*ai indicates the inner product) take less time to compare to than dodecahedron ranges. In addition, from the practical perspective, comparing a simplex (or a point) to γ is what consumes most of the time and it makes sense to count the time they cost separately. Therefore, the size and depth of the arithmetic circuit that computes IsContaining and IsCrossing are denoted by t and , respectively.
Referring again to d), Γ is the range family the partition tree was built for, ξ is a bound on the crossing number of a range γ∈Γ, and h is a parameter determined by the partition theorem (Theorem 1). From this, it follows that the PPRangeSearch algorithm needs to recurse into at most ξ children at each inner node.
The inputs for the PPRangeSearch algorithm include T and γ, where T is a partition tree (or a subtree) and γ is a range. A plaintext notation for T is used because the structure of the partition tree (e.g., depth and number of children) is not encrypted, however it should be noted that the fields T.ƒ and T.σ, which depend on the private data, are encrypted. The output of the algorithm is x where x=ƒ(P∩γ).
The PPRangeSearch algorithm operates recursively. When the PPRangeSearch algorithm is called, it is called to operate on the partition tree T. The PPRangeSearch algorithm then calls itself recursively with a subset of subtrees under the root of T as input. The recursion stops when the PPRangeSearch algorithm reaches a leaf node of the partition tree T. While traversing the partition tree T the PPRangeSearch algorithm collects v.ƒ from various nodes and uses the function g to aggregate these values.
The improved efficiency of the PPRangeSearch algorithm comes, at least in part, from guaranteeing that at most ξ children need to be recursed into. This is done under FHE using copy-and-recurse, i.e., by making a copy of ξ children and their subtrees (among them are the children that need to be recursed into) and recursing into these ξ copies. As noted previously, the copy-and-recurse operation may utilize a buffer to achieve the copy-and-recurse operation.
The PPRangeSearch algorithm, in accordance with one illustrative embodiment, i.e., Algorithm 3 of
In the case where v is an inner node (Lines 5-17) the PPRangeSearch algorithm checks each child to determine whether its bounding simplex, σ, contains γ (Line 7). The simplex is a generalization of a polyhedron to arbitrary dimensions, e.g., a 0-dimensional simplex is a point, a 1-dimensional simplex is a line segment, a 2-dimensional simplex is a triangle, a 3-dimensional simplex is a tetrahedron, and a 4-dimensional simplex is a 5-cell.
The algorithm uses the function g to aggregate [[v.child[i.]ƒ]] for the children that are contained in γ (Line 8). Then, the algorithm checks which child's simplex cross γ (Line 10). These values are kept in a r-dimensional binary vector Cont. Since Sv.child[i]⊂v.σ⊂γ, the algorithm can then aggregate ƒ(Sv.child[i]) without checking the points of Sv.child[i]. Then, using the BuildCopyAndRecurseMatrix algorithm (see Algorithm 4 described hereafter with regard to
The PPRangeSearch algorithm then recurses into the subtrees whose simplices cross γ to check a finer partition (i.e., into smaller sets) of their points (lines 14-16). From the properties of the partition tree, the number of these children is at most ξ. In Line 16 the copy of ξ children is processed to aggregate ƒ(Sv.child[i]∩γ) for the children whose simplices cross γ. ƒ(Sv.child[i]∩γ), which is computed under FHE by recursing into the subtree of v.child[i]. The encrypted output [[x]] is then output by the PPRangeSearch algorithm.
As noted above, one important aspect of the illustrative embodiments is the copy-and-recurse operation, performed by the copy-and-recurse engine 246 in
As shown in r:
To understand how Algorithm 4 works it is noted that M[i, j]=1 iff c [i] is the j-th cell with a value of 1. Algorithm 4 starts by setting (Line 2):
One can see that M[1, i]=1 iff c[i] is the first non-zero element in c, i.e., c[i]=1 and c[k]=0 for 1≤k<i. Then, Algorithm 4 continues by setting (Line 5):
which is now explained. M[j−i] [k]=1 iff c[k] is the (j−1)-st element with a value of 1. Πh=k+1i−1(1−c[h])=1 iff c[k+1]= . . . =c [i−1]=0. Putting these together and summing for all values of k<i one gets that Σk=1i−1(M[j−i, k] Πh=k+1i−1(1−c[h])=1 if there are exactly j−1 values of 1 in c[1], . . . , c[i−1]. Multiplying this by c[[i], one obtains that M[j,i]=1 iff c[i] is the j-th value of 1.
The Analysis of the size and depth of Algorithm 4, i.e., the BuildCopyAndRecurseMatrix, is summarized in the following lemma.
Lemma 4. Computing M[col, row] for 1≤col≤ξ and 1≤row≤r can be done with a circuit of depth O(ξ·log r) and size O(ξ·r2).
Proof. The lemma is proved by induction on x. For ξ=1 one has M[1, row]:=c [row]·Πi=1row−1(1−c[i]) which can be done in a circuit of depth O(log row) and size O(row). Computing for all rows in parallel, one gets a circuit of depth O(log r) and size O(r2). Assuming it holds for all ξ′<ξ it is proved that this holds for ξ. Since one has M[ξ, row]=c [row]Σk=1row−1(M[ξ−i,k]Πh=k+1row−1(1−c[h])), this can be done with a circuit whose depth is O(log r+(ξ−1) log r) and size is O(r2+(ξ−1)r2), which proves the claim.
With regard to the partition tree range search engine 244 operation, e.g., performance of the PPRangeSearch algorithm (Algorithm 3 in
Lemma 5. Let P,T be as in Lemma 2, where |P|=n and r<n is a parameter, then T needs space of O(n1+ϵ), where the value of ϵ depends on r and can be made arbitrarily small.
Proof. From Lemma 2 the number of nodes is
Since O(1) data are kept with each node, the total space is O(n1+ϵ), where ϵ=(logr h)/(1−logrh) can be made arbitrarily small by choosing a large r.
Turning now to analyzing the size and depth of the circuit that computes a range search query in accordance with one or more of the illustrative embodiments, consider Theorem 2 as follows:
Informally, the correctness of Theorem 2 follows from the plaintext algorithm that Algorithm 3 implements. The bound on the circuit size is proved by solving the recursion formula of the circuit size. The circuit depth is proved by induction on the tree height. It should be appreciated that Algorithm 3 deviates from a plaintext algorithm in at least 3 primary ways: (1) it adds empty nodes; (2) it always recurses into ξ children (for inner nodes) and (3) it uses the Cross and Cont indicator arrays to conditionally aggregate values into the output.
At each inner node, v, Algorithm 3: (1) computes IsContaining and IsCrossing r times; (2) builds a copy-and-recurse matrix M; (3) copies ξ children of v; and (4) recurses into ξ children of v. Computing all of the IsContaining and IsCrossing computations takes O(t·r) time. From Lemma 4, computing the copy-and-recurse matrix takes O(r2·ξ). The size of each child (including its subtree) is
and copying ξ children (out of r) takes
It follows that the time to compute a range query is given by the following recursion rule:
This solves to:
For the case d=1, from Lemma 1, ξ=2 and h=1. Putting these into the above equation, one obtains:
For the case d>2 and Theorem 1, ξ=O(r1−1/d) and h=O(1). Putting these into Lemma 1 one obtains:
Putting these together, the circuit size is
The circuit depth may be proven by induction on the height of T. For a tree T of height 1 the root has r leaf children. The circuit starts with r instances of IsContaining and r instances of IsCrossing in parallel, whose depth is . Then the circuit has a subcircuit for the algorithm BuildCopyAndRecurseMatrix whose depth is O(ξ log r). Then, the circuit has an instance of matrix multiplication whose depth is constant. The total depth
+O(ξ·log r).
Assuming the circuit depth of a tree of height (d−1) is (d−1)+(d−1)O(ξ·log r) we prove for a tree of height d. For a tree of height d>1 the circuit has r instances of IsContaining and r instances of IsCrossing in parallel. Then, the algorithm has a BuildCopyAndRecurseMatrix subcircuit followed by ξ subcircuits that compute range search queries on subtrees of height (d−1). This yields a circuit depth of d·
+dO(·ξlog r)=O(
·log n).
Thus, with the copy-and-recurse based PPRangeSearch algorithm, e.g., Algorithm 3 in
Since, more complex ranges mean a higher t value, and since in practice comparing a range to such objects is the dominant part of the running time, in order to achieve a more efficient operation, it is important to reduce this number of compares. The illustrative embodiments provide an improvement, over the naive implementation of checking every point, of O(n·t), where again O(n) is a lower bound when running under FHE, and O(t·n1−1/d+ξ) is the best bound known in plaintext when allowing near linear storage.
The efficiency in the performance achieved by the illustrative embodiments comes, at least in part, from the way the partition tree is traversed, which takes advantage of the properties of partition trees with regard to each node being bound to only ξ number of children that need to be recursed into. These properties are used to recurse into ξ children, thus achieving similar results (ignoring the O(n) overhead to allow this recursion) as the plaintext algorithm.
The copy-and-recurse based PPRangeSearch algorithm may be implemented by a FHE enabled search engine to perform various types of FHE operations involving range searches. Each of these operations may have the functions ƒ and g, discussed above, set such that the corresponding operation is performed using the mechanisms of the illustrative embodiments. Examples of these operations include counting, reporting, minimum, averages and k-means clustering, and the like. The following will describe these example operations, where in each of these examples, again P is a set of points and Γ is a family of ranges with γ∈Γ, and all are in d, for some d≥1.
For example, with regard to the counting operation, this operation may be characterized as the problem of computing |P∩γ|, i.e. how many points of P are in γ. For this, the operation sets ƒ: 2P→N which is defined as ƒ(A)=|1|, and sets g: N×N →N, which is defined as g(a,b)=a+b.
As another example, with regard to the reporting operation, this operation may be characterized as the problem of outputting the points in |P∩γ|. Here the points are not reported explicitly, as that would violate the privacy requirements of FHE, but instead what is reported is O(log n) canonical subsets Sv1, . . . ,Svm such that ∩iSvi=P∩γ. The canonical subsets are the sets associated with nodes in the partition tree. To report them, the operation assigns an identifier to each node and outputs the identifier of the node. For this, the operation sets ƒ: 2P→2N, that is, ƒ maps a set A⊂P into the set of identifiers of canonical subsets whose union is A. In this case, ƒ is defined as ƒ(Sv)=ID(Sv), where ID(·) is a function returning a unique identifier for each subset Sv associated with a node. Similarly, g is set to be g: 2N×2N→2N and is defined as g(A,B)=A∪B.
In another example, with regard to the minimum, or “Min”, operation, this operation may be characterized as reporting minp∈P∩γ(cost(p)), where cost: P→R is some cost function. To report the minimum, the operation sets ƒ: 2P→R, defines ƒ(A)=minp∈PA(cost(p)), and defines g(a,b)=min(a,b). It should be noted that min under FHE is costly to compute, however using the partition tree implementation of PPRangeSearch of the illustrative embodiments yields only O(log n) calls to min, as oppose to O(n) calls using the naive approach of checking every point in P.
In still another example, with regard to averages, the average of a set A is
Since division is costly under FHE, to compute an average, the operation sets ƒ: 2P→P×R and defines ƒ(A)=(SumA,SizeA), where SumA=PAa and SizeA|A|. The average can be computed Avg(A)=SumA/SizeA.
With regard to k-means clustering as another example, for a fixed k, a k-means clustering can be computed by randomly picking k “center” points c1, . . . , ck and then repeating the following operations:
whose geometric shape is a polytope with k−1 faces. Then, the new center is set to be ci:=Avg(P∩γci).
These are only examples of computer operations that may be performed using FHE and the partition tree based range search mechanisms of the illustrative embodiments. Other computer operations that may be represented as range searches may also make use of the improved computing tool and improved computing tool functionality/operations of the illustrative embodiments without departing from the spirit and scope of the present invention.
Thus, the illustrative embodiments provide an improved computing tool and improved computing tool functionality/operations that operate to efficiently perform fully homomorphic encryption (FHE) based operations using partition trees and range searching mechanisms to avoid having to evaluate all data points or records of a database and instead focus on specific ranges, or portions of the partition tree, that are completely within a given range or intersect a given range of the FHE based operation. This makes the operation more efficient in that less data points or records need to be evaluated.
The illustrative embodiments described above implement partition tree generation logic, partition tree range search logic, copy-and-recurse logic, compact logic, and the like, to provide an improved computer functionality for performing FHE operations.
As shown in
An encrypted query may then be received from a client computing device, e.g., an application executing on a separate computing device, such as clients 210, 212 (step 930). It is assumed for this description that the encrypted query is one that can be represented as a range search under FHE. The encrypted query is encrypted with the public key of the client and can then be decrypted by the cloud service provider to identify the operation being requested.
The decrypted query is used to generate the ranges and parameters for performing a partition tree range search, such as identifying the input parameters to the partition tree range search algorithm, e.g., PPRangeSearch of
Again, it should be noted that while the illustrative embodiments are described with reference to partition trees and range search operations, the illustrative embodiments are not limited to partition trees and range searches. To the contrary, the copy-and-recurse operations and computing tool may operate on other types of tree data structures and perform other operations as noted above, that are not limited to range searches. For example, the illustrative embodiments may operate on decision tree or search tree data structures. In a decision tree there is a bound ξ=1 on the number of children that need to be recursed into. This leads to a circuit of size O(n+t·nϵ), where t is the size of the circuit that evaluates the condition at a node. In r-ary search trees (for example B-trees) there is a bound ξ=1 on the number of children (with r depending on the parameter of the B-tree). This leads to a circuit of size O(n+t·nϵ), where t is the size of the circuit that evaluates the comparison.
As is apparent from the detailed description above, the present invention may be a specifically configured computing system, configured with hardware and/or software that is itself specifically configured to implement the particular mechanisms and functionality described herein, a method implemented by the specifically configured computing system, and/or a computer program product comprising software logic that is loaded into a computing system to specifically configure the computing system to implement the mechanisms and functionality described herein. Whether recited as a system, method, of computer program product, it should be appreciated that the illustrative embodiments described herein are specifically directed to an improved computing tool and the methodology implemented by this improved computing tool. In particular, the improved computing tool of the illustrative embodiments specifically provides a FHE enabled search engine that implements logic for efficiently performing range searches of partition trees using a copy-and-recurse functionality that efficiently identifies portions of a data set to which to apply FHE operations, then perform those FHE operations on the identified portions, and return encrypted results. The improved computing tool implements mechanism and functionality, such as the FHE enabled search engine 240 of cloud computing service 230 in
Computer 1001 may take the form of a desktop computer, laptop computer, tablet computer, smart phone, smart watch or other wearable computer, mainframe computer, quantum computer or any other form of computer or mobile device now known or to be developed in the future that is capable of running a program, accessing a network or querying a database, such as remote database 1030. As is well understood in the art of computer technology, and depending upon the technology, performance of a computer-implemented method may be distributed among multiple computers and/or between multiple locations. On the other hand, in this presentation of computing environment 1000, detailed discussion is focused on a single computer, specifically computer 1001, to keep the presentation as simple as possible. Computer 1001 may be located in a cloud, even though it is not shown in a cloud in
Processor set 1010 includes one, or more, computer processors of any type now known or to be developed in the future. Processing circuitry 1020 may be distributed over multiple packages, for example, multiple, coordinated integrated circuit chips. Processing circuitry 1020 may implement multiple processor threads and/or multiple processor cores. Cache 1021 is memory that is located in the processor chip package(s) and is typically used for data or code that should be available for rapid access by the threads or cores running on processor set 1010. Cache memories are typically organized into multiple levels depending upon relative proximity to the processing circuitry. Alternatively, some, or all, of the cache for the processor set may be located “off chip.” In some computing environments, processor set 1010 may be designed for working with qubits and performing quantum computing.
Computer readable program instructions are typically loaded onto computer 1001 to cause a series of operational steps to be performed by processor set 1010 of computer 1001 and thereby effect a computer-implemented method, such that the instructions thus executed will instantiate the methods specified in flowcharts and/or narrative descriptions of computer-implemented methods included in this document (collectively referred to as “the inventive methods”). These computer readable program instructions are stored in various types of computer readable storage media, such as cache 1021 and the other storage media discussed below. The program instructions, and associated data, are accessed by processor set 1010 to control and direct performance of the inventive methods. In computing environment 1000, at least some of the instructions for performing the inventive methods may be stored in block 200 in persistent storage 1013.
Communication fabric 1011 is the signal conduction paths that allow the various components of computer 1001 to communicate with each other. Typically, this fabric is made of switches and electrically conductive paths, such as the switches and electrically conductive paths that make up busses, bridges, physical input/output ports and the like. Other types of signal communication paths may be used, such as fiber optic communication paths and/or wireless communication paths.
Volatile memory 1012 is any type of volatile memory now known or to be developed in the future. Examples include dynamic type random access memory (RAM) or static type RAM. Typically, the volatile memory is characterized by random access, but this is not required unless affirmatively indicated. In computer 1001, the volatile memory 1012 is located in a single package and is internal to computer 1001, but, alternatively or additionally, the volatile memory may be distributed over multiple packages and/or located externally with respect to computer 1001.
Persistent storage 1013 is any form of non-volatile storage for computers that is now known or to be developed in the future. The non-volatility of this storage means that the stored data is maintained regardless of whether power is being supplied to computer 1001 and/or directly to persistent storage 1013. Persistent storage 1013 may be a read only memory (ROM), but typically at least a portion of the persistent storage allows writing of data, deletion of data and re-writing of data. Some familiar forms of persistent storage include magnetic disks and solid state storage devices. Operating system 1022 may take several forms, such as various known proprietary operating systems or open source Portable Operating System Interface type operating systems that employ a kernel. The code included in block 230 typically includes at least some of the computer code involved in performing the inventive methods.
Peripheral device set 1014 includes the set of peripheral devices of computer 1001. Data communication connections between the peripheral devices and the other components of computer 1001 may be implemented in various ways, such as Bluetooth connections, Near-Field Communication (NFC) connections, connections made by cables (such as universal serial bus (USB) type cables), insertion type connections (for example, secure digital (SD) card), connections made through local area communication networks and even connections made through wide area networks such as the internet. In various embodiments, UI device set 1023 may include components such as a display screen, speaker, microphone, wearable devices (such as goggles and smart watches), keyboard, mouse, printer, touchpad, game controllers, and haptic devices. Storage 1024 is external storage, such as an external hard drive, or insertable storage, such as an SD card. Storage 1024 may be persistent and/or volatile. In some embodiments, storage 1024 may take the form of a quantum computing storage device for storing data in the form of qubits. In embodiments where computer 1001 is required to have a large amount of storage (for example, where computer 1001 locally stores and manages a large database) then this storage may be provided by peripheral storage devices designed for storing very large amounts of data, such as a storage area network (SAN) that is shared by multiple, geographically distributed computers. IoT sensor set 1025 is made up of sensors that can be used in Internet of Things applications. For example, one sensor may be a thermometer and another sensor may be a motion detector.
Network module 1015 is the collection of computer software, hardware, and firmware that allows computer 1001 to communicate with other computers through WAN 1002. Network module 1015 may include hardware, such as modems or Wi-Fi signal transceivers, software for packetizing and/or de-packetizing data for communication network transmission, and/or web browser software for communicating data over the internet. In some embodiments, network control functions and network forwarding functions of network module 1015 are performed on the same physical hardware device. In other embodiments (for example, embodiments that utilize software-defined networking (SDN)), the control functions and the forwarding functions of network module 1015 are performed on physically separate devices, such that the control functions manage several different network hardware devices. Computer readable program instructions for performing the inventive methods can typically be downloaded to computer 1001 from an external computer or external storage device through a network adapter card or network interface included in network module 1015.
WAN 1002 is any wide area network (for example, the internet) capable of communicating computer data over non-local distances by any technology for communicating computer data, now known or to be developed in the future. In some embodiments, the WAN may be replaced and/or supplemented by local area networks (LANs) designed to communicate data between devices located in a local area, such as a Wi-Fi network. The WAN and/or LANs typically include computer hardware such as copper transmission cables, optical transmission fibers, wireless transmission, routers, firewalls, switches, gateway computers and edge servers.
End user device (EUD) 1003 is any computer system that is used and controlled by an end user (for example, a customer of an enterprise that operates computer 1001), and may take any of the forms discussed above in connection with computer 1001. EUD 1003 typically receives helpful and useful data from the operations of computer 1001. For example, in a hypothetical case where computer 1001 is designed to provide a recommendation to an end user, this recommendation would typically be communicated from network module 1015 of computer 1001 through WAN 1002 to EUD 1003. In this way, EUD 1003 can display, or otherwise present, the recommendation to an end user. In some embodiments, EUD 1003 may be a client device, such as thin client, heavy client, mainframe computer, desktop computer and so on.
Remote server 1004 is any computer system that serves at least some data and/or functionality to computer 1001. Remote server 1004 may be controlled and used by the same entity that operates computer 1001. Remote server 1004 represents the machine(s) that collect and store helpful and useful data for use by other computers, such as computer 1001. For example, in a hypothetical case where computer 1001 is designed and programmed to provide a recommendation based on historical data, then this historical data may be provided to computer 1001 from remote database 1030 of remote server 1004.
Public cloud 1005 is any computer system available for use by multiple entities that provides on-demand availability of computer system resources and/or other computer capabilities, especially data storage (cloud storage) and computing power, without direct active management by the user. Cloud computing typically leverages sharing of resources to achieve coherence and economies of scale. The direct and active management of the computing resources of public cloud 1005 is performed by the computer hardware and/or software of cloud orchestration module 1041. The computing resources provided by public cloud 1005 are typically implemented by virtual computing environments that run on various computers making up the computers of host physical machine set 1042, which is the universe of physical computers in and/or available to public cloud 1005. The virtual computing environments (VCEs) typically take the form of virtual machines from virtual machine set 1043 and/or containers from container set 1044. It is understood that these VCEs may be stored as images and may be transferred among and between the various physical machine hosts, either as images or after instantiation of the VCE. Cloud orchestration module 1041 manages the transfer and storage of images, deploys new instantiations of VCEs and manages active instantiations of VCE deployments. Gateway 1040 is the collection of computer software, hardware, and firmware that allows public cloud 1005 to communicate through WAN 1002.
Some further explanation of virtualized computing environments (VCEs) will now be provided. VCEs can be stored as “images.” A new active instance of the VCE can be instantiated from the image. Two familiar types of VCEs are virtual machines and containers. A container is a VCE that uses operating-system-level virtualization. This refers to an operating system feature in which the kernel allows the existence of multiple isolated user-space instances, called containers. These isolated user-space instances typically behave as real computers from the point of view of programs running in them. A computer program running on an ordinary operating system can utilize all resources of that computer, such as connected devices, files and folders, network shares, CPU power, and quantifiable hardware capabilities. However, programs running inside a container can only use the contents of the container and devices assigned to the container, a feature which is known as containerization.
Private cloud 1006 is similar to public cloud 1005, except that the computing resources are only available for use by a single enterprise. While private cloud 1006 is depicted as being in communication with WAN 1002, in other embodiments a private cloud may be disconnected from the internet entirely and only accessible through a local/private network. A hybrid cloud is a composition of multiple clouds of different types (for example, private, community or public cloud types), often respectively implemented by different vendors. Each of the multiple clouds remains a separate and discrete entity, but the larger hybrid cloud architecture is bound together by standardized or proprietary technology that enables orchestration, management, and/or data/application portability between the multiple constituent clouds. In this embodiment, public cloud 1005 and private cloud 1006 are both part of a larger hybrid cloud.
As shown in
It should be appreciated that once the computing device is configured in one of these ways, the computing device becomes a specialized computing device specifically configured to implement the mechanisms of the illustrative embodiments and is not a general purpose computing device. Moreover, as described hereafter, the implementation of the mechanisms of the illustrative embodiments improves the functionality of the computing device and provides a useful and concrete result as previously noted above.
The description of the present invention has been presented for purposes of illustration and description, and is not intended to be exhaustive or limited to the invention in the form disclosed. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the described embodiments. The embodiment was chosen and described in order to best explain the principles of the invention, the practical application, and to enable others of ordinary skill in the art to understand the invention for various embodiments with various modifications as are suited to the particular use contemplated. The terminology used herein was chosen to best explain the principles of the embodiments, the practical application or technical improvement over technologies found in the marketplace, or to enable others of ordinary skill in the art to understand the embodiments disclosed herein.