1. Field of the Invention
The present invention is directed to query verification of untrusted servers, and more specifically to a query verifier that uses a synopsis to verify query results.
2. Brief Discussion of Related Art
Due to the overwhelming flow of information in many data stream applications, data outsourcing is a natural and effective paradigm for individual businesses to address the issue of scale. In conventional data outsourcing models, the data owner outsources streaming data to one or more third-party servers, which answer queries posed by a potentially large number of clients on the data owner's behalf. Data outsourcing intrinsically raises issues of trust. Conventional approaches to query verification build upon cryptographic primitives, such as signatures and collision-resistant hash functions, which typically only work for certain types of queries, such as simple selection/aggregation queries.
Conventional industrial and academic Data Stream Management Systems (DSMS) have been developed in recent years. The need for such DSMSs is mainly driven by the continuous nature of the data being generated by a variety of real-world applications, such as telephony and networking. Providing fast and reliable querying services on the streaming data to clients is central to many businesses. However, due to the overwhelming data flow observed in most data streams, companies typically do not possess the necessary resources for deploying a DSMS. In these cases, outsourcing the data stream and the desired computations to a third-party server can be the only alternative. Outsourcing also solves the issue of scale. That is, as the number of clients increases, the number of mirroring servers employed by the data owner can be increased. In addition, this can often lead to faster query responses, since these servers can be closer to the clients than a single centralized server. However, because data outsourcing and remote computations raise issues of trust, outsourced query verification on data streams is a problem with important practical implications.
For example, a data owner with limited resources, such as memory and bandwidth, may outsource its data stream to one or more remote, untrusted servers that can be compromised, malicious, running faulty software, etc. A client registers a continuous query on the DSMS of the server and receives results upon request. Assuming that the server charges the data owner according to the computation resources consumed and the volume of traffic processed for answering the queries, the server then has an incentive to deceive the owner and the client for increased profit. Furthermore, the server might have a competing interest to provide fraudulent answers to a particular client. Hence, a passive malicious server could drop query results or provide random answers in order to reduce the computation resources required for answering queries, while a compromised or active malicious server might be willing to spend additional computational resources to provide fraudulent results (by altering, dropping, or introducing spurious answers). In other cases, incorrect answers might simply be a result of faulty software, or due to load shedding strategies, which are essential tools for dealing with bursty streaming data.
Ideally, the data owner and the client should be able to verify the integrity of the computation performed by the server using significantly fewer resources than having the query answered directly, i.e., where the data owner evaluates the query locally and then transmits the entire query result to the client. If a client wants to verify the query results with absolute confidence, the only solution is for the data owner to evaluate the query exactly and transmit the entire result to the client, which obviates the need of outsourcing.
Further, the client should have the capability to tolerate errors caused by load shedding algorithms or other non-malicious operations, while at the same time being able to identify mal-intended attacks which have a significant impact on the result.
Embodiments of the present invention are directed to a method, medium, a computing system for verifying a query result of an untrusted server. A data stream is outsourced to the untrusted server, which is configured to respond to a query with the query result. A verification synopsis is generated using at least a portion of the query result and a seed, which in some embodiments can uses at most 3 words of memory. The verification synopsis includes a polynomial, where coefficients of the polynomial are determined based on the seed. The verification synopsis and the seed are output to a client for verification of the query result. Verification of the query result can be performed in a single pass of the query result. A result synopsis can be computed using the query result and the result synopsis is compared with the verification synopsis to verify the query result.
In some embodiments, the seed can be required by the client to use the verification synopsis and can have a value that is undisclosed by the data owner until the seed is output to the client. An alarm can be raised when the verification synopsis and the query results do not match and/or the number of errors between the verification synopsis and the query results exceeds a threshold. In some embodiments, errors in the query result can be located and corrected using the verification synopsis and/or the number of errors in the query result can be estimated using the verification synopsis.
In some embodiments, a vector can be maintained corresponding to the query result and the verification synopsis can be computed using the vector when a request for verification is received.
In some embodiments, layers can be generated, where each layer includes buckets. One of the buckets can be represented by the verification synopsis and elements of the vector are assigned to one bucket per layer.
Other objects and features of the present invention will become apparent from the following detailed description considered in conjunction with the accompanying drawings. It is to be understood, however, that the drawings are designed as an illustration only and not as a definition of the limits of the invention.
Exemplary embodiments of the present invention are illustrated using “GROUP BY, COUNT” and “GROUP BY, SUM” queries known to those skilled in the art, although it will be recognized that other queries can be used. In network monitoring applications, a computation of the total number of packets originated and destined to certain IP addresses is often desired. The “GROUP BY, COUNT” query is substantially equivalent to computing the frequencies (e.g., occurrences) of tuples in a stream. Most streaming algorithms deal with either the frequencies directly or their relatives, such as frequency moments, heavy hitters, quantiles, and inverse frequencies, which focus on computing the answers to these queries, but not their verification.
Embodiments of the present invention include a query verifier that uses algebraic and probabilistic techniques to compute a small synopsis on the true query result, which can be communicated to the client for verification of whether the query result returned by the server are correct. A synopsis is a small summary based on a query result that is maintained and updated by the data owner and includes a polynomial with coefficients that are based on query results and a random seed. As used herein, a “verification synopsis refers to a synopsis generated by the data owner using a vector based on query results and a “result synopsis” refers to a synopsis generated by a client using query results. A polynomial is a mathematical expression having constants and variables forming terms, each term consisting of a constant multiplier and one or more terms raised to a power, where the terms are separated by mathematical operations. A seed is a random number defined by the data owner that is required to perform the verification of query results using the synopsis. Embodiments of the present invention advantageously provide high-confidence probabilistic solutions with arbitrarily minuscule probability of error, and develop verification algorithms that incur minimal resources, in terms of both the memory consumption of the data owner and the data owner-client network bandwidth.
To ensure that the untrusted server 110 is providing valid results, the data owner 130 includes a query verifier 132, which generates and maintains a randomized synopsis 134 of the query results. The query verifier 132 generates the randomized synopsis 134, which is referred to herein as a “polynomial identity random synopsis” (PIRS), to raise an alarm with very high confidence if errors in the query results 116 exist. Specifically, the data owner 130 preferably maintains the synopsis 134 at a constant size of three machine words (e.g., 3 bytes), and transmits the synopsis 134 to the client 120 preferably via a secure channel upon a verification request 122. The query result 116 can be verified using this small synopsis 134. The synopsis 134 can be maintained using constant space and low processing cost per tuple in the stream (O(1) for count queries and O(log n) or O(log μ) for sum queries, where n is the number of possible groups and μ is the update amount per tuple, and O denotes an asymptotic upper bound. In some embodiments, the synopsis 134 can be used for verifying multiple simultaneous queries with the same aggregate attribute but different group-by partitioning, where the size of the synopsis 134 is the same as that for verifying a single query.
In some embodiments, the query verifier 132 can implement a buffer 136 for data streams that exhibit a large degree of locality. The buffer 136 can be used to store exact aggregate results for a small number of groups. With data locality, a large portion of updates hit the buffer 136. Whenever the buffer 136 is full and a new group needs to be inserted, a victim is evicted from the buffer using a simple least recently used (LRU) policy. Only then does the evicted group update PIRS, using the overall aggregate value computed within the buffer 136. The buffer 136 is emptied to update PIRS whenever verification is required.
Embodiments of the query verifier 132 can use generalizations of the synopsis 134, using the basic PIRS as a building block to create different verification schemes. In some embodiments, a generalization of PIRS is generated for raising alarms when a number of errors exceed a predefined threshold γ. This synopsis, referred to herein as PIRSγ, allows the server 110 to return query results having a predefined margin of error, which can result from, for example, using semantic load shedding. When the number of errors is less than the predefined threshold γ, the server is still considered to be trustworthy.
In some embodiments, a weaker version of PIRSγ, referred to herein as PIRS±γ, can be generated for allowing the server 110 to return query results having an approximate predefined margin of error, where if the errors are fewer than γ− the answer is considered valid, while if the errors are more than γ+ the answer is considered invalid.
In some embodiments, a strengthened version of PIRSγ, referred to herein as PIRSγ*, can be generated so that when the number of errors is tolerable, the errors can be located and even corrected. When PIRSγ* is implemented, it can also act as an error-correcting code, which can guarantee that the complete and correct query results can be delivered to the client 120 when the number of errors is less than the predefined threshold γ.
In some embodiments, an embodiment of the synopsis 134, referred to herein as FM-PIRS, can be used to estimate the actual number of errors accurately. FM-PIRS is a parameter-free version of PIRS±γ in the sense that it does not depend on the predefined threshold γ. In particular, when the number of errors exceeds the predefined threshold γ, PIRS±γ raises an alarm, while FM-PIRS also reports an estimate of the actual number of errors. FM-PIRS has a smaller size than PIRS±γ for large enough threshold γ.
Exemplary queries used to illustrate the query verifier 132 can have the following structure:
GROUP BY aggregate queries can have wide applications in monitoring and statistical analysis of data streams (e.g., in networking and telephony applications). An example of such a query that appears frequently in network monitoring applications is the following:
The above query (*) is used for illustrating exemplary embodiments, where sum and count are used, and is referred to herein as the “illustrative query”. Other aggregates that can be converted to sum and count, such as average, standard deviation, and the like, can be supported, by verifying each component separately (i.e., verifying the sum and the count in the case of average).
The “GROUP BY” predicate partitions streaming tuples into a set of n groups, computing one sum per group. The data stream can be viewed as a sequence of additions (or subtractions) over a set of items in [n]={1, . . . , n}). This data stream S and its τ-th tuple is denoted herein as sτ=(i, uτ), an update of amount uτ to the ith group. The query answer can be expressed as a dynamic vector of non-negative integers vτ=|v1τ, . . . , vnτ|∈ containing one component per group aggregate. Initially, v0 is the zero vector. A new tuple sτ=(i, uτ) updates the corresponding group i in vτ as viτ=viτ−1+uτ. The amount uτ can either be positive or negative, but require viτ≧0 for all τ and i. When count queries are concerned, uτ is substantially equal to 1 for all τ. An assumption can be made that the L1 norm of vτ is always bounded by some large m, i.e., at any
The following definition captures the semantics of Continuous Query Verification (CQV):
Formal Definition 1. Given a data stream S, a continuous query Q and a user defined parameter
a synopsis X of vector v is designed such that for any τ, given any wτ and using X(vτ), an alarm is raised with probability at least 1−δ if wτ≠vτand an alarm is not raised if wτ=vτ.
In some embodiments wτ can be, for example, an answer provided by the server 110, while X(vτ) can be the synopsis 134 communicated to the client 120 from the data owner 130 for verifying vector vτ.
Using the above definition, an alarm can be raised with high probability if any component (or group answer) viτis inconsistent. As an example, consider a server that is using semantic load shedding, i.e., dropping tuples from certain groups, on bursty stream updates. In this example, the aggregate of a certain, small number of components will be inconsistent without malicious intent. In some embodiments, a certain degree of tolerance in the number of erroneous answers contained in the query results is allowed, rather than raising alarms indistinctly. The following definition captures the semantics of Continuous Query Verification with Tolerance for a Limited Number of Errors (CQVγ):
Formal Definition 2. For any w, v ∈ let E(w, v)={i|wi≠vi}. Then define w≠γv if and only if |E(w, v)|≧γ and w=γ v if and only if |E(w, v)|<γ. Given a data stream S, a continuous query Q, and user defined parameters
a synopsis X of vector v is designed such that, for any τ, given any wτ and using X(vτ), an alarm is raised with probability at least 1−δ, if wτ≠γvτ; and an alarm is not raised if wτ=γvτ.
Formal definition 1 is a special case of formal definition 2, where the threshold γ=1.
Some embodiments can support random load shedding (i.e., can tolerate small absolute or relative errors on any component irrespective of the total number of inconsistent components. The following definition captures the semantics of Continuous Query Verification with Tolerance for Small Errors (CQVη).
Formal Definition 3. For any w, v ∈ let w≠η v if and only if there is some i such that |wi−vi|>η, and w≈η v iff |wi−vi|≦η for all i ␣ [n]. Given a data stream S, a continuous query Q, and user defined parameters η and
a synopsis Z of vector v can be designed such that, for any τ, given any wτ and using X(vτ), an alarm is raised with probability at least 1−δ, if wτ≠ηvτ; and an alarm is not raised if wτ≈ηvτ.
The formal definition 3 requires the absolute errors for each viτ to be no larger than η. It is also possible to use relative errors (i.e., raise an alarm if and only if there is some i such that |wiτ−viτ|/|viτ|>η). Thus, formal definition 1 is also a special case of formal definition 3 with η=0.
Those skilled in the art will recognize that further variations and cases can also be defined. For example, one may wish to bound the sum of the absolute errors, or bound both the number and the size of the errors. Under a standard RAM model, used for illustrative purposes, it is assumed that addition, subtraction, multiplication, division, or modular arithmetic operations involving two words take one unit of time. It is also assumed that n/δ and m/δ fit in a word.
Embodiments of the Polynomial Identity Random-Synopsis (PIRS) can be denoted by X(v). The synopsis 134 is based on testing the identity of polynomials by evaluating the polynomials at a randomly chosen point. The technique of verifying polynomial identities is well-known to those skilled in the art. PIRS can have two variants, named PIRS-1 and PIRS-2, respectively.
For PIRS-1, let p be some prime such that max{m/δ, n}<p. For the space analysis, let p≦2 max{m/δ, n}. For PIRS-1, the field is used, where additions, subtractions, and multiplications are done modulo p. For the first PIRS, denoted PIRS-1, a seed α is chosen from uniformly at random and X(v) is computed incrementally from X(vτ−1) and sτ=(i, uτ) as:
X(vτ)=X(vτ−1)(α−i)u
X(v)=(α−1)v
The values of n, m, δ, p are known to the data owner 130 and the clients 120. The data owner 130 secretly picks a seed α and maintains X(v). Upon a verification request 122, the data owner 130 returns the synopsis 134 (e.g., PIRS) to the client 120, which can include only two words: seed α and X(v). Given any answer w (i.e., query results 116) returned by the server 110, the client 120 can use PIRS to check if w=v with high probability, by computing
X(w)=(α−1)w
If
the answer w is rejected immediately as being erroneous. If X(w)=X(v), then it is declared that w=v. Otherwise, an alarm is raised. Using this approach, a false alarm is never raised, and it can be shown a true alarm will be missed with a probability at most δ. Given any w≠v, PIRS raises an alarm with probability at least 1−δ. To illustrate this, consider polynomials fv(x)=(x−1)v
The update time to maintain X(v) as new updates are observed can be determined as follows. For count queries, each tuple increments one of the vi's by one, so the update cost is constant (one subtraction and one multiplication). For sum queries, a tuple s=(i, u) increases vi by u, so (α−i)u is computed, which can be done in O(log u) (exponentiation by repeated squaring) time. To perform a verification with the answer w, (x−i)wi is computed for each nonzero entry wi of the answer w, which takes O(log wi) time, so the time needed for a verification is
Since both X(v) and seed α are smaller than p, the space complexity of the synopsis is
bits.
PIRS-1 occupies
bits of space, spends O(1) (resp. O(log u)) time to process a tuple for count queries, and
time to perform a verification.
When u is negative (or when handling deletions for count queries), the field may not be equipped with division. In this case, (α−i) is computed, which is the multiplicative inverse of (α−i) in , in O(log p) time, and then compute (α−i)−1·|u|.
For PIRS-2, space usage can be improved using PIRS-2 when n<<m, in which the prime p is chosen between max {m, n/δ} and 2max {m, n/δ}. For a seed α chosen uniformly at random from , the following can be computed.
X(v)=v1α+v2α2+ . . . +vnαn.
By adding on uαi in response to update s=(i, u), the above synopsis is straightforward to maintain over a stream of updates. PIRS-2 has an O(log n) update cost for both count and sum queries, since uαi is computed for a tuple (i, u) in the stream. PIRS-2 occupies
bits of space, spends O(log n) time to process a tuple, and O(|w|log n) time to perform a verification.
One property of either variant of PIRS (e.g., PIRS-1 or PIRS-2) is that the verification can be performed in one pass of the answer w using a constant number of words of memory. This is advantageous when |w| is large. The client 120 can receive the answer w in a streaming fashion, verify it online, and either forward it to a dedicated server 140 (
Embodiments of the synopses solving the CQV problem with error probability of at most δ keep
To illustrate this, assume that the vector v and the answer w are both taken from a universe U, and let M be the set of all possible memory states the synopsis can have. Any synopsis X can be seen as a function f: U→M; and if X is randomized, it can be seen as a function randomly chosen from a family of such functions F={f1, f2, . . . , fl}, where fi is chosen with probability p(f i). Without loss of generality, assume that p(f1)≧p(f2)≧ . . . ≧p(fl). The synopsis X needs at least log |M| bits to record the output of the function and log |F| bits to describe the function chosen randomly from F.
For any w≠v ∈ U, let Fw,v={f ∈ F|f(w)=f(v)}. For a randomized synopsis X to solve CQV with error probability at most δ, the following must hold for all w≠v ∈ U:
Focusing on the first k=┌δ|F|┐+1 functions f1, . . . , fk, it can be seen that
Since there are a total of |M|k possible combinations for the outputs of these k functions
|U|≦|M|k (6)
so that no two w≠v ∈ U have fi(w)=fi(v) for all i=1, . . . , k; otherwise an answer w and a vector v can be found that violate equation (5). Taking the log of both sides of equation (6), the following results
log|U|≦(┌δ·|F|┐+1)log|M|. (7)
Since vector v has n entries whose sum is at most m, by simple combinatorics, |U|≦(nm+n), or log|U|≧min {m,n}. The following tradeoff can therefore be obtained.
|F|·log|M|=Ω(min{m, n}/δ). (8)
If log·|F|≦(1−E)log(min{m, n}/δ) (i.e., |F|≦(min{m,n}/δ)1−ε) for any constant ε ∈ (0,1), then X uses super-polylogarithmic space log |M|=Ω((min{m, n}/δ)ε); else X keeps log|F|≧log (min{m, n})δ) random bits. Therefore, when m≦n, PIRS-1 is optimal as long as log
and when m>n, PIRS-2 is optimal as long as log
The bounds are not tight when log
or
The analysis above focuses on bit-level space complexity. The value of p is chosen to be the maximum prime that fits in a word, so as to minimize δ, where δ=m/p for PIRS-1 and δ=n/p for PIRS-2. For example, if 64-bit words are used and m<232, then δ is at most 2−32 for PIRS-1, which makes any error highly unlikely (e.g., 1 in four billion). For speed consideration, careful choice of p can allow faster implementation. For example, choosing p to be a Mersenne prime (e.g., p is of the form p=2l−1 for some l) allows the modulo arithmetic to be performed using simple addition and subtraction operations.
Since the group id i is extracted from each incoming tuple directly, without the use of a dictionary (which would increase the memory cost), the size of the group space, n, needs to be large for certain queries. For example, the exemplary query discussed above has a group space of n=264 (the combination of two IP addresses), although the actual number of nonzero entries |v| may be much less than n. In this case, since m is typically much smaller, PIRS-1 is the better choice in this example.
Embodiments of the present invention can be used for handling multiple queries simultaneously. For example, consider a number of aggregate queries on a single attribute (e.g., packet size), but with different partitioning on the input tuples (e.g., source/destination IP and source/destination port). Let Q1, . . . ,Qk be k such queries, and let the i-th query partition the incoming tuples into ni groups for a total of
groups. In some embodiments, PIRS can be applied once per query, using space linear in k. In other embodiments, the queries can be treated as one unified query of n groups so that one PIRS is used to verify the combined vector v. The time cost for processing one update increases linearly in k, since each incoming tuple is updating k components of vector v at once (one group for every query in the worst case):
Using PIRS-1 for k queries can occupy
bits of space, spend O(k) (resp. O(k log u)) time to process a tuple for count (sum) queries, and
time to perform a verification. As a result, multiple queries can be effectively verified with a few words of memory and communication.
After an error has been detected, the client 120 can choose to disclose this information to the server 110. If the error is not reported, then the probability of detecting an error remains 1−δ. However, errors can occur due to faulty software or bad communication links, and may not be intentional. In this case, it can be beneficial to give a warning to the server 110. Since an adversary can extract knowledge from this warning (e.g., it knows at least that the same response on the same data will always fail), the guarantee of detecting an error with the probability of 1−δ does not strictly hold. In order to restore the 1−δ success rate after a reported attack, the synopsis 134 has to be recomputed from scratch, which is not typically possible in a streaming setting. Hence, it can be important to rigorously quantify the loss of guarantee after a series of warnings have been sent out without resetting the synopsis.
To achieve this let ek=1 if the k-th attack goes undetected and ek=0 otherwise. Let pk be the probability that the server succeeds in its k-th attack after k−1 failed attempts, (i.e., pk=Pr[ek=1|e1=0, . . . , ek−1=0]). Therefore, it can be determined that p1≦δ. To demonstrate the strength of one embodiment of the present invention, pk is bounded by an upper limit with respect to the most powerful server A. It can be assumed that the server A knows how PIRS works except its random seed, α; maximally explores the knowledge that could be gained from one failed attack; and possesses unbounded computational power.
The best server A could do to improve pk over multiple attacks is quantified. The space of seeds used by PIRS is denoted R. For any answer w and vector v, the set of witnesses is denoted as W(w, v)={r ∈| PIRS raises an alarm on r} and the set of non-witnesses is denoted as W(w, v)=R−W(w, v). Note that |
where
Assuming that server A has made a total of k attacks to PIRS for any k, the probability that none of them succeeds is at least 1−kδ.
This probability is
The above shows that PIRS is very resistant towards coordinated multiple attacks, even against an adversary with unlimited computational power. For a typical value of δ=2−32, PIRS could tolerate millions of attacks before the probability of success becomes noticeably less than 1. The drop in the detection rate to 1−kδ occurs only if the client chooses to disclose the attacks to the server. Such disclosure is not required in many applications.
The PIRS can be extended to support sliding windows. PIRS-1 for count queries is used as an illustrative example. Those skilled in the art will recognize that the extension for sliding windows to sum queries, as well as to PIRS-2, PIRSγ, and PIRS±γ can also be similarly implemented.
One property of PIRS-1 is that it is decomposable (i.e., for any v1, v2, X(v1+v2)=X(v1) X(v2) and for PIRS-2, X(v1+v2)=X(v1)+X(v2)). This property allows PIRS to be extended for periodically sliding windows using standard techniques. One example of a sliding window query might be the following.
In this example, PIRS-1 can be built for every 5-minute period, and can be kept in memory until it expires from the sliding window. Assume that there are k such periods in the window, and let X(v1), . . . ,X(vk) be the PIRS for these periods. In addition, the data owner maintains the overall
When a new PIRS X(vk+1) completes, X(v) is updated as X(v):=X(v) X(vk+1) (X(v1))−1. For a periodically sliding window query with k periods, the synopsis uses
bits of space, spends O(1) time to process an update, and
time to perform a verification.
For various window sizes consisting of between 1 to k periods the k periods can be decomposed into a number of dyadic intervals. For simplicity assume that k is a power of 2. These intervals can be organized into l=log k levels. On level 0, there are k intervals each consisting of one period; on level i, 1≦i≦l−1, there are k/2i intervals, each spanning 2i periods. There are a total of 2k−1 such dyadic intervals for this example. One PIRS is built for each interval, so the total size of the entire synopsis is still
Since a PIRS at level i+1 can be computed in constant time from two PIRS's at level i, the amortized update cost remains O(1). Upon a verification request with a window size of q periods, the window can be decomposed into at most O(log k) dyadic intervals, and those corresponding PIRS's can be combined together to form the correct synopsis for the query window. To support sliding window queries with various window sizes of up to k periods, the synopsis uses
bits of space, spends O(1) time to process an update, and O(log k) time to assemble the required synopsis upon a verification request. The client spends
time to perform a verification.
In some embodiments, a synopsis that is tolerant for a few errors, solving the CQVγ problem can be implemented. To achieve this, let the threshold γ be the number of components in vector v that are allowed to be inconsistent. In some embodiments, a construction is presented that gives an exact solution that satisfies the requirements of CQVγ, and requires
bits of space. This synopsis can be strengthened so that errors can be located and even corrected. This exact solution uses space quadratic in γ. In other embodiments, an approximate solution which uses only
bits can be implemented. In other embodiments, a synopsis that can estimate the number of errors can be implemented that uses polylogarithmic space and does not depend on the threshold γ.
Using PIRS as a building block, the synopsis 134 can be constructed that satisfies the requirements of CQVγ. Referring to
Examining one layer of PIRSγ, let b be a pairwise independent hash function which maps the range {1, . . . , n} uniformly onto {1, . . . , k}. PIRSγ assigns vi to the b(i)-th bucket, and for each bucket computes the PIRS synopsis of the assigned subset of vi's with probability of failure δ′=1/(c2γ) where c2≧1 is a constant. Using PIRS-1 as an example, each of these k synopses occupies
bits. Given some w=γv, since there are fewer than γ errors, no alarm is raised. Constants c1 and c2 can be chosen such that if w≠γv, then an alarm is raised with probability at least ½ for this layer. In this case there are two cases when the query verifier fails to raise an alarm. First where there are fewer than γ buckets that contain erroneous components of w. Second where there are at least γ buckets containing erroneous components but at least one of them fails due to the failure probability of PIRS. Setting constants c1, c2=4.819, either of the above cases occurs with probability at most ¼. In the first case, the vi's are assigned to the buckets in a pairwise independent fashion, and it can be guaranteed that the mapping of the γ erroneous components onto the k buckets is injective with probability
where the last inequality holds by the choice of c1. In the second case the probability that some of the γ buckets that are supposed to raise an alarm fail is:
which holds as long as c2≧4.819.
Therefore, using one layer, PIRSγ raises an alarm with probability at least ½ on some w≠γv, and will not raise an alarm if w=γv. By using log(1/δ) layers and reporting an alarm if at least one of these layers raises an alarm, the probability is boosted to 1−δ. So, for any w≠γv, PIRSγ raises an alarm with probability at least 1−δ, and for any w=γv, PIRSγ does not raise an alarm.
In addition to the k log(1/δ) PIRS synopses, a hash function b mapping updates to buckets is generated. This is achieved by picking x and y uniformly at random from , and computing b(i)=xi+y mod p mod k, where “mod” is used wherein for modulo. This generates a function that is pairwise-independent over the random choices of x and y. Verification can be performed by computing, in parallel, for the layers while making one pass over the answer w. Initialization, update, and verification for PIRSγ appear in psuedcode below.
PIRSγ requires
bits, spends
time to process a tuple in the stream, and
time to perform a verification. Careful analysis can facilitate a smaller constant in the asymptotic cost above. For a given γ, the minimum k is chosen such that equation (11) is at most ½, and 1/δ′ is chosen to be very large (close to the maximum allowed integer) so that equation (12) is almost zero. For instance
words suffice, respectively. For arbitrary γ, the storage requirement is
words in the worst case.
When there are a small number of errors (at most γ), PIRSγ does not raise an alarm, which gives some leeway to the server 110. This is often necessary so that the server can cope with large volumes of incoming data using some semantic load shedding strategies. However, in some critical applications, if the client 120 demands complete correctness, PIRSγ may not be sufficient, since it may only indicate to the client 120 if there are less than γ errors, but not where they are. In some embodiments, a strengthened version of PIRSγ, referred to herein as PIRSγ*, can be implemented, that is able to identify which groups are affected by errors, and even compute the correct sums for the affected groups by taking advantage of a technique based on the binary decomposition of the group identifier.
Applying the binary decomposition to PIRSγ, the amount of information kept about each bucket is increased. In addition to keeping a PIRS synopsis of all items which fall into a given bucket, 2┌log n┐ number of PIRS synopses are maintained and arranged as a two-dimensional array A of size ┌log n┐×2. When an update to group i is placed into bucket b(i), the PIRS in A[j, bit(i, j)] is updated, for all 1≦j≦┌log n┐×2, where bit(i, j) denotes the jth bit in the binary representation of i.
To perform a query verification, the array A of PIRS synopses is computed for both the vector v and the answer w for each bucket. If all corresponding entries match, then (with high probability) there is no erroneous components in the bucket. If, for any j, the PIRS in both A[j, 0] and A[j, 1] do not match, then this indicates that there is more than one erroneous component in this bucket, because a single erroneous i cannot contaminate both A[j, 0] and A[j, 1]. Otherwise, there must be exactly one erroneous component falling into this bucket. This is the case for all erroneous components with high probability, providing that there are at most γ such components. In this case, for each j, exactly one of A[j, 0] and A[j, 1] do not match. If it is A[j, 1], this indicates that the jth bit of the identifier i of the erroneous group is 1 otherwise, it is 0. Using ┌log n┐ pairs of PIRS, the identifier can therefore be recovered exactly.
The erroneous components wi in the answer w returned by the server 110 can be located. Moreover, enough information to recover each true vi for each wrong result can exist. For example, suppose the bucket at layer l contains exactly one error, which is vi. Note that the data owner will return Xb
In PIRSγ*, each PIRS in PIRSγ is replaced with an array of O(log n) PIRS, so the space and time increases by an O(log n) factor. PIRSγ* requires
bits, spends
time to process a tuple in the stream, and
time to perform a verification. For any w≠γv, PIRSγ* raises an alarm with probability 1−δ; for any w=γv, PIRSγ* does not raise an alarm but correctly identifies and recovers the errors in the answer w with probability 1−δ.
When the number of errors, for example λ, is no more than γ, PIRSγ* can recover all the errors with high probability. When λ>γ, there may be too many errors to expect a complete recovery of all the query results. Nevertheless, PIRSγ* can recover a good prortion of the results. For this analysis, precision and recall are used to measure the performance of the synopsis. Precision refers to the probability that an identified error is truly an actual error. Since PIRS does not have false positives, precision is always 1. Recall, is the percentage of the actual errors that have been recovered, or equivalently, the probability that any one error has been captured by the synopsis. For any given error ε, if the error ε falls into a bucket by itself in any of the layers, then PIRSγ* can correctly recover it. For a particular layer, because the errors are distributed into the buckets pairwise-independently and there are c1γ2 buckets, the probability that the bucket containing ε is the same as the bucket for any of the other λ−1 errors is at most λ/(c1γ2) following the union bound. Since the log(1/δ) layers are mutually independent, the probability that this collision happens in all layers is
When there are λ>γ errors, PIRSγ* raises an alarm with probability 1−δ and recovers the errors with a recall of 1−δΩ(log(γ
The exact solution is advantageous when only a small number of errors can be tolerated. In applications where γ is large, the quadratic space requirement can be prohibitive. If alarms can be raised when approximately γ errors have been observed, space-efficient synopsis can be implemented. This approximation is often acceptable since when γ is large, users may not be concerned if the number of errors detected deviates from γ by a small amount. An approximate solution, denoted with PIRS±γ, guarantees that the PIRS±γ raises no alarm with probability at least 1−δ on any w=γ
and raises an alarm with probability at least 1−δ on any w≠γ
for and constant c>−ln ln 2≈0.367. The multiplicative approximation ratio
is close to 1 for large γ.
PIRS±γ also contains multiple layers of buckets, where each bucket is assigned a subset of the components of vector v and summarized using PIRS (
independent layers and reporting the majority of the results, the probabilistic guarantee will be boosted to 1−δ using Chemoff bounds described in “Randomized Algorithms”, by Motwani et al., the subject matter of which is incorporated by reference in its entirety.
As an example, let k be the number of buckets per layer. The components of vector v are distributed into the k buckets in a γ+-wise independent fashion, and for each bucket the PIRS summary of those components is computed using δ′=1/γ2. Given some answer w, let this layer raise an alarm only if all the k buckets report alarms. If the answer w contains more than γ+ erroneous members, then the probability that every bucket gets at least one such component is high; and if the answer w contains fewer than γ− erroneous members, then the probability that there exists some bucket that is not assigned any erroneous members is also high.
One factor that determines whether a layer could possibly raise an alarm is the distribution of erroneous components into buckets. The event that all buckets raise alarms is only possible if each bucket contains at least one inconsistent component. Consider all the inconsistent components in the answer w in some order, for example w1,w2, . . . , where of each of them can be considered a collector that randomly picks a bucket to “collect”. Assume that there are enough inconsistent elements, and let the random variable Y denote the number of inconsistent components required to collect all the buckets (i.e., Y is the smallest i such that w1, . . . ,wi have collected all the buckets). The problem becomes an instantiation of the coupon collector's problem (viewing buckets as coupons and erroneous components as trials). With k buckets, it is known that E(Y)=k ln k+O(k), therefore k is set such that γ=┌k ln k┐. It can be seen that k=O(γ/ln γ), hence the desired storage requirement.
For any constant c′,
Pr[Y≦k((ln k−c′)]≦e−e
Pr[Y≧k((ln k+c′)]≦1−e−e
where o(1) depends on k.
Notice that ln γ≦2 ln k for any k≧2, so the above equations also implies that for any real constant c:
If w=γ
e
−e
≦1/e. (17)
If w≠γ
For γ large enough, there exists a constant ε>0 such that this probability is at most ½−ε for any c>−ln ln 2.
To summarize, if c>−ln ln 2≈0.367, then both the false positive and false negative probabilities are at most ½−ε for some constant ε at one layer with k=O(γ/log γ) buckets.
To drive down the error probabilities for both false positives and false negatives to δ, l=O(log(1/δ)) layers are used and the simple majority of their “votes” is reported. This probability is quantified for false negatives; the other case is symmetric.
Each layer can be viewed as a coin flip that raises a true alarm with probability at least ½+ε. Let the random variable Z denote the number of layers that raise alarms. This process is a sequence of independent Bernoulli trials, hence Z follows the binomial distribution. For l independent layers, the expectation of Z is at least μ=(½+ε)l. By the Chemoff bound, the probability that a majority of layers raise alarms is
Therefore, it is ensured that
which can be satisfied by taking
A γ+-wise independent random hash function is generated to map groups to buckets. Using standard techniques, such a function can be generated using O(γlog n) truly random bits. Specifically, the technique for constructing t-universal hash families can be used. Let p be some prime between n and 2n, and α0, . . , αγ−1 be γ random numbers chosen uniformly and independently from . Then we set
This function is guaranteed to be drawn from a t-wise independent family of functions (so that, over the random choice of the function, the probability of t items colliding under the hash function is 1/kt−1). For an incoming tuple s=(i, u), b(i) is computed using the αj's in O(γ) time (using Hormer's rule), and then perform the update to the corresponding PIRS. This requires the storage of O(γ+)=O(γ) truly random numbers per layer. As a result, PIRS±γ uses
bits of space, spends
time to process an update and
time to perform a verification. By allowing two-sided errors, the size of the synopsis can be reduced from quadratic in γ to linear.
An improved solution, FM-PIRS for the CQVγ problem can be generated, whose size and update cost only depend on the degree of approximation, but not γ, thus allowing it to scale well with γ. FM-PIRS directly estimates the number of errors in the result provided by the server, and then compares the estimate with γ. As a result, FM-PIRS can also support a wider range of values of γ, which can be given only at verification time. For small values of γ, the bounds and guarantees of PIRSγ and PIRS±γ are preferred, but for larger values of γ, the cost of FM-PIRS is preferable.
As the name suggest, FM-PIRS is a combination of PIRS and an FM sketch, which is described in “Probabilistic Counting Algorithms For database Applications”, by Flajolet et al., the subject matter of which is incorporated herein by reference in its entirety. The FM sketch is used to estimate the number of distinct elements in a stream. The FM sketch is described as follows. Suppose that the universe is [n]={1, . . . , n}. A random hash function h:[n]→[2L−1] is picked such that any h(i) is uniformly distributed over [2L−1], where L=O(log n). For each element i in the stream h(i)is computed. The number of trailing zeros in the binary representation of h(i) are denoted by r(i). The FM sketch computes R=max{r(i), for all i in the stream} and then outputs k/φ·2(R
To illustrate the FM-PIRS synopsis, the basic FM sketch with k=1 is used although those skilled in the art will recognize that generalization to larger k is possible. Each “wrong” group i can be treated such that vi≠wi is a distinct element in the universe [n], and then R=max{r(i), for all wrong groups i} can be computed. Generally, the data owner does not know whether i is a wrong group, so r(i) cannot be computed directly. Instead, a number L of PIRS's X1, . . . ,XL with δ′=δ/L are created. For any i, group i is put into Xj if j≦r(i). Thus X1 gets half of the groups, X2 gets a quarter of the groups, etc.
The value of R can be computed as follows. When all of X1, . . . ,XL correctly capture the errors in them, which happens with probability at least 1−δ′L=1−δ, R=arg maxj {Xj raises an alarm}.
For k partitions, fix any k, FM-PIRS has a size of O(k log n(log m+log n)) bits, processes a tuple in expected time O(1), and computes an estimate of the number of errors in the result in expected time O(|w|log m|w|). With probability at least 1−δ, the estimate has a bias bounded by 1+0.31/k and a standard error of 0.78/√k.
Since each partition keeps L=O(log n) PIRS's, the overall size of FM-PIRS is O(k log n(log m+log n)) bits. For an incoming tuple, only one partition gets affected, but 0 to L of the PIRS's in this partition can get updated. Since the hash function h is uniform, the expected number of PIRS's updated is O(1). Upon receiving the FM-PIRS synopses of vector v and an answer w from the server, we need to spend O(log wi) expected time per non-zero entry of the answer w to compute the FM-PIRS synopses of the answer w. So the expected time needed for an estimation is
An analytical comparison of PIRS±γ and FM-PIRS can be provided. Since FM-PIRS computes an estimate of the number of errors in the answer w, FM-PIRS can be used to do the same task for which PIRS±γ is designed. For a fair comparison, k is set such that FM-PIRS provides the same probabilistic guarantee that PIRS±γ does. Since the standard error of FM-PIRS is O(1/√{square root over (k)}) and PIRS±γ allows a deviation of O(1/ln γ).
By setting k=O(log2 γ), it can be guaranteed that FM-PIRS captures both false positives and false negatives with good probabilities (e.g., greater than ¾). Finally, by using O(log 1/oδ) independent copies of FM-PIRS and taking the median, the success probability can be boosted to 1−δ, the same as what PIRS±γ guarantees. Finally, only L=O(log γ) are needed since estimating the number of errors when there are over, for example, 2γ of them is not required.
Under this configuration, FM-PIRS uses O(log3 γ(log m+log n) log 1/oδ) bits of space. Thus, asymptotically (as γ grows) FM-PIRS is better than PIRS±γ. However, for small γ PIRS±γ should be better in terms of size, while FM-PIRS becomes better when γ exceeds some large threshold.
Applications 310 can be resident in the storage 308. The applications 310 can include instructions for implementing embodiments of the present invention. For embodiments where the computing device 300 is implemented as the data owner 130, the applications 310 can include instructions for implementing the query verifier 132. For embodiments where the computing device 300 is implemented as the untrusted server 110, the applications 310 can include instructions for implementing the DSMS 112. For embodiments where the computing device is implemented as the client 120, the applications 310 can include instructions for implementing the queries, as well as for implementing the verification of query results using the synopsis 134 generated by the data owner. The storage 308 can be local or remote to the computing device 300. The computing device 300 includes a network interface 312 for communicating with a network. The CPU 302 operates to run the application in storage 308 by performing instructions therein and storing data resulting from the performed instructions, which may be presented to an operator via the display 304 or by other mechanisms known to those skilled in the art, such a print out from a printer.
Although preferred embodiments of the present invention have been described herein with reference to the accompanying drawings, it is to be understood that the invention is not limited to those precise embodiments and that various other changes and modifications may be affected herein by one skilled in the art without departing from the scope or spirit of the invention, and that it is intended to claim all such changes and modifications that fall within the scope of the invention.