The present invention generally relates to data streams, and more particularly relates to measuring distance between data in a data stream.
Recent years have witnessed an explosive growth in the amount of available data. Data stream algorithms have become a quintessential tool for analyzing such data. These algorithms have found diverse applications, such as large scale data processing and data warehousing, machine learning, network monitoring, and sensor networks and compressed sensing. A key ingredient in all these applications is a distance measure between data. In nearest neighbor applications, a database of points is compared to a query point to find the nearest match. In clustering, classification, and kernels, e.g., those used for support vector machines (SVM), given a matrix of points, all pairwise distances between the points are computed. In network traffic analysis and denial of service detection, global flow statistics computed using Net-Flow software are compared at different times via a distance metric. Seemingly unrelated applications, such as the ability to sample an item in a tabular database proportional to its weight, i.e., to sample from the forward distribution, or to sample from the output of a SQL Join, require a distance estimation primitive for proper functionality.
One of the most robust measures of distance is the 1-distance (rectilinear distance), also known as the Manhattan or taxicab distance. The main reason is that this distance is robust is that it less sensitive to outliers. Given vectors x, y∈n, the 1-distance is defined as
This measure, which also equals twice the total variation distance, is often used in statistical applications for comparing empirical distributions, for which it is more meaningful and natural than Euclidean distance. The 1-distance also has a natural interpretation for comparing multisets, whereas Euclidean distance does not. Other applications of 1 include clustering, regression (and with applications to time sequences), Internet-traffic monitoring, and similarity search. In the context of certain nearest-neighbor search problems, “the Manhattan distance metric is consistently more preferable than the Euclidean distance metric for high dimensional data mining applications”. The 1-distance may also support faster indexing for similarity search.
Another application is with respect to estimating cascaded norms of a tabular database, i.e. the p norm on a list of attributes of a record is first computed, then these values are summed up over records. This problem is known as 1(p) estimation. An example application is in the processing of financial data. In a stock market, changes in stock prices are recorded continuously using a rlog quantity known as logarithmic return on investment. To compute the average historical volatility of the stock market from the data, the data is segmented by stock, the variance of the rlog values are computed for each stock, and then these variances are averaged over all stocks. This corresponds to an 1 (2) computation (normalized by a constant). As a subroutine for computing 1(2), the best known algorithms use a routine for 1-estimation.
In one embodiment, a method for determining a distance between at least two vectors of n coordinates is disclosed. The method comprises identifying a set of heavy coordinates from a set of n coordinates associated with at least two vectors. A heavy coordinate is represented as |xi|≧∈2∥x∥1, where x is a vector, i is a coordinate in the set of n coordinates, and c is an arbitrary number. A set of light coordinates is identified from the set of n coordinates associated with the at least two vectors, wherein a light coordinate is represented as |xi|<∈2∥x∥1. A first estimation of a contribution is determined from the set of heavy coordinates to a rectilinear distance between the at least two vectors. A second estimation of a contribution is determined from the set of light coordinates to the rectilinear distance norm. The first estimation is combined with the second estimation.
In another embodiment, an information processing system for determining a distance between at least two vectors of n coordinates is disclosed. The information processing system comprises a memory and a processor that is communicatively coupled to the memory. A data stream analyzer is communicatively coupled to the processor and the memory. The data stream analyzer is configured to perform a method. The method comprises identifying a set of heavy coordinates from a set of n coordinates associated with at least two vectors. A heavy coordinate is represented as |xi|≧∈2∥x∥1, where x is a vector, i is a coordinate in the set of n coordinates, and c is an arbitrary number. A set of light coordinates is identified from the set of n coordinates associated with the at least two vectors, wherein a light coordinate is represented as |xi|<∈2∥x∥1. A first estimation of a contribution is determined from the set of heavy coordinates to a rectilinear distance between the at least two vectors. A second estimation of a contribution is determined from the set of light coordinates to the rectilinear distance norm. The first estimation is combined with the second estimation.
In yet another embodiment, a computer program product for determining a distance between at least two vectors of n coordinates is disclosed. The computer program product comprises a storage medium readable by a processing circuit and storing instructions for execution by the processing circuit for performing a method comprising. The method comprises identifying a set of heavy coordinates from a set of n coordinates associated with at least two vectors. A heavy coordinate is represented as |xi|≧∈2∥x∥1, where x is a vector, i is a coordinate in the set of n coordinates, and ∈ is an arbitrary number. A set of light coordinates is identified from the set of n coordinates associated with the at least two vectors, wherein a light coordinate is represented as |xi|<∈2∥x∥1. A first estimation of a contribution is determined from the set of heavy coordinates to a rectilinear distance between the at least two vectors. A second estimation of a contribution is determined from the set of light coordinates to the rectilinear distance norm. The first estimation is combined with the second estimation.
The accompanying figures where like reference numerals refer to identical or functionally similar elements throughout the separate views, and which together with the detailed description below are incorporated in and form part of the specification, serve to further illustrate various embodiments and to explain various principles and advantages all in accordance with the present invention, in which:
according to one embodiment of the present invention;
Operating Environment
As shown in
Computer system/server 102 typically includes a variety of computer system readable media. Such media may be any available media that is accessible by computer system/server 1002, and it includes both volatile and non-volatile media, removable and non-removable media. System memory 106, in one embodiment, comprises a data stream analyzer 110 that performs one or more of the embodiments discussed below with respect to measuring distance between data. It should be noted that the data stream analyzer 110 can also be implemented in hardware as well. The system memory 106 can include computer system readable media in the form of volatile memory, such as random access memory (RAM) 112 and/or cache memory 114.
Computer system/server 102 can further include other removable/non-removable, volatile/non-volatile computer system storage media. By way of example only, storage system 116 can be provided for reading from and writing to a non-removable, non-volatile magnetic media (not shown and typically called a “hard drive”). Although not shown, a magnetic disk drive for reading from and writing to a removable, non-volatile magnetic disk (e.g., a “floppy disk”), and an optical disk drive for reading from or writing to a removable, non-volatile optical disk such as a CD-ROM, DVD-ROM or other optical media can be provided. In such instances, each can be connected to bus 108 by one or more data media interfaces. As will be further depicted and described below, memory 106 may include at least one program product having a set (e.g., at least one) of program modules that are configured to carry out the functions of embodiments of the invention.
Program/utility 118, having a set (at least one) of program modules 120, may be stored in memory 106 by way of example, and not limitation, as well as an operating system, one or more application programs, other program modules, and program data. Each of the operating system, one or more application programs, other program modules, and program data or some combination thereof, may include an implementation of a networking environment. Program modules 120 generally carry out the functions and/or methodologies of embodiments of the invention as described herein.
Computer system/server 102 may also communicate with one or more external devices 122 such as a keyboard, a pointing device, a display 124, etc.; one or more devices that enable a user to interact with computer system/server 126; and/or any devices (e.g., network card, modem, etc.) that enable computer system/server 102 to communicate with one or more other computing devices. Such communication can occur via I/O interfaces 126. Still yet, computer system/server 102 can communicate with one or more networks such as a local area network (LAN), a general wide area network (WAN), and/or a public network (e.g., the Internet) via network adapter 128. As depicted, network adapter 1026 communicates with the other components of computer system/server 102 via bus 108. It should be understood that although not shown, other hardware and/or software components could be used in conjunction with computer system/server 102. Examples, include, but are not limited to: microcode, device drivers, redundant processing units, external disk drive arrays, RAID systems, tape drives, and data archival storage systems, etc.
Overview
The inventors paper entitled “Fast Manhattan Sketches in Data Streams”, by Jelani Nelson and David P. Woodruff, ACM PODS '10 Indiana, IN, USA which is hereby incorporated by reference in its entirety. As discussed above, the 1-distance, also known as the Manhattan or taxicab distance, between two vectors x, y in n is Σi=1n|xi−yi|. Approximating this distance is a fundamental primitive on massive databases, with applications to clustering, nearest neighbor search, network monitoring, regression, sampling, and support vector machines. One or more embodiments of the present invention are directed to the problem of estimating the 1-distance in the most general turnstile model of data streaming.
Formally, given a total of m updates (positive or negative) to an n-dimensional vector x, one or more embodiments maintain a succinct summary, or sketch, of what has been seen so that at any point in time the data stream analyzer can output an estimate E(x) so that with high probability, (1−∈)∥x∥1≦E(x)≦(1−∈)∥x∥1, where ∈>0 is a tunable approximation parameter. Here, an update has the form (i, v), meaning that the value v should be added to coordinate i. One or more embodiments assume that v is an integer (this is without loss of generality by scaling), and that |v|≦M, where M is a parameter. Updates can be interleaved and presented in an arbitrary order. Of interest is the amount of memory to store the sketch, the amount of time to process a coordinate update, and the amount of time to output an estimate upon request.
One or more embodiments of the present invention are advantageous because they give the first 1-pass streaming algorithm for this problem in the turnstile model with O*(∈−2) space and O*(1) update time where the bounds are optimal up to O*(1) factors. The O* notation hides polylogarithmic factors in ∈, n, and the precision required to store vector entries. In particular, one or more embodiments provide 1-pass algorithm using ∈−2 polylog(nmM) space for 1-estimation in data streams with polylog(nmM) update time, and reporting time ∈−2 polylog(nmM). This algorithm is simultaneously optimal in both the space and the update time up to polylog(nmM) factors. Conventional algorithms either required at least ∈−3 polylog(nmM) bits of space, or at least ∈−2 update time. As ∈ can be arbitrarily small, the result of one or more embodiments can provide a substantial benefit over conventional algorithms. In light of known lower bounds, the space and time complexity of these one or more embodiments are optimal up to polylog(nmM) factors.
It should be noted that in the following discussion, for a function ƒ the notation O*(ƒ) is used to denote a function g=O(ƒ·polylog (nmM|∈)). Θ* and Ω* are similarly defined.
The improvements provided by one or more embodiments of the present invention result in corresponding gains for the aforementioned applications. Examples include the scan for nearest neighbor search, for which to obtain sketches of size O* (∈−2), these embodiments reduce the preprocessing time from O(nd∈−2) to O* (nd). These embodiments also shave an ∈−2 factor in the time for computing all pairwise 1-distance, in the update time for sampling from the forward distribution, in the time for comparing two collections of traffic-flow summaries, and in the time for estimating cascaded norms.
Techniques
Using the Cauchy sketches of Li (particularly, the geometric mean estimator) would require Ω*(∈−2) update time. Multi-level sketches can be used, incurring an extra Ω*(∈−1) factor in the space. Various embodiments of the present invention achieve O*(1) update time by using Cauchy sketches (and particularly, Li's geometric mean estimator). However, to achieve this result one or more embodiments preprocess and partition the data, as discussed in greater detail below.
A Cauchy sketch is now described. Given a vector x, the sketch is a collection of counters Yj=Σi=1n=x1Ci,j for j=1, . . . , k, where the are standard Cauchy random variables with probability density function
The Ci,j are generated pseudo-randomly using a pseudo-random generator (PRG). By the 1-stability of the Cauchy distribution, Yj is also distributed as a standard Cauchy random variable, scaled by ∥x∥1. Li shows that there is a constant ck>0 so that for any k≧3, if k≧3, if Y1, . . . , Yk are independent Cauchy sketches, then the geometric mean estimator
EstGM=ck·(|Y1|·|Y2| . . . |Yk|)1/k
has an expected value E[EstGM]=∥x∥1 a variance of Var[EstGM]=Θ(∥x∥12/k). The space and time complexity of maintaining the Y3 in a data stream are O*(k), and by linearity, can be computed in a single pass. By Chebyshev's inequality, for k=Θ(∈−2) one obtains a (1±∈)-approximation to ∥x∥1 with constant probability, which can be amplified by taking the median of independent repetitions. While the space needed is O*(∈−2), so is the update time.
The starting point of one or more embodiments is the following idea. Suppose the coordinates into O(∈−2) are randomly partitioned into buckets. In each bucket Li's estimator is maintained, but only with parameter k=3. Given an update to a coordinate i, it lands in a unique bucket, and the contents of this bucket can be updated in O*(1) time. Using Θ(∈−2) buckets, the space is also O*(∈−2). One is then faced with the following temptation: letting Gi be the estimate returned by Li's procedure in bucket i for k=3, output G=Σi=1rGi. From the properties of the Gi, this is correct in expectation.
The main wrinkle is that Var[G] can be as large as Ω(∥x∥12), which is not good enough.
To see that this can happen, suppose x contains only a single non-zero coordinate x1=1. In the bucket containing x1, the value G of Li's estimator is the geometric mean of 3 standard Cauchy random variables. By the above, Var[G]=Θ(∥x∥12/k)=Θ(∥x∥12).
Note though in the above example, x1 contributed a large fraction of the 1 mass of x (in fact, all of it). The main idea of one or more embodiments then is the following. A φ-heavy coordinate of the vector x is a coordinate i for which |xi|≧φ·∥x∥1. Algorithms for finding heavy coordinates, also known as iceberg queries, have been extensively studied in the database community, and such algorithms in the algorithm of one or more embodiments of the present invention. Set φ=∈2. Every φ-heavy coordinate is removed from x, the contribution of these heavy coordinates are estimated separately, then the bucketing above is used on the remaining coordinates. This reduces Var[G] to O(∥xtail∥22), where xtail is the vector obtained from x by removing the heavy coordinates. A calculation shows that O(∥xtail∥22)=O(∈2∥x∥12), which is good enough to argue that ∥xtail∥1 can be estimated to within an additive ∈∥x∥1 with constant probability. This idea can be implemented in a single pass.
The main remaining hurdle is estimating ∥xhead∥1, the contribution to ∥x∥1 from the heavy coordinates. Using current techniques the CountMin sketch, can be used to estimate the value of each ∈2-heavy coordinate up to an additive ∈3∥x∥1. Summing the estimates gives ∥xhead∥1 up to an additive ∈∥x∥1. This, however, requires Ω*(∈−3) space, which, in some embodiments, cannot be afforded. Instead, a new subroutine, Filter, is designed that estimates the sum of the absolute values of the heavy coordinates, i.e., the value ∥xhead∥1, up to an additive ∈∥x∥1, without guaranteeing an accurate frequency estimate to any individual heavy coordinate. This relaxed guarantee is sufficient for correctness of our overall algorithm, and is implementable in O*(∈−2) space.
Other technical complications arise due to the fact that the partitioning is not truly random, nor is the randomness used by Li's estimator. Therefore, one or more embodiments use a family that is close to an O(∈−2)-wise independent family, but doesn't suffer the O(∈−2) evaluation time required of functions in such families (e.g., O(∈−2)-degree polynomial evaluation). These functions can be evaluated in constant time. The caveat is that the correctness analysis needs more attention.
Preliminaries
The algorithm used by the data stream analyzer 110 operates, in one embodiment, in the following model. A vector x of length n is initialized to 0, and it is updated in a stream of m updates from the set [n]×{−M, . . . , M}. An update (i, v) corresponds to the change xi←xi+v. In one embodiment, a (1±∈)-approximation to ∥x∥1=Σi=1n|xi| is computed for some given parameter ∈>0. All space bounds in this discussion are in bits, and all logarithms are base 2, unless explicitly stated otherwise. Running times are measured as the number of standard machine word operations (integer arithmetic, bitwise operations, and bitshifts). A differentiation is made between update time, which is the time to process a stream update, and reporting time, which is the time required to output an answer. Each machine word is assumed to be Ω(log(nmM/∈)) bits so that index each vector can be indexed and arithmetic can be performed on vector entries and the input approximation parameter in constant time.
Throughout this discussion, for integer z, [z] is used to denote the set {1, . . . , z}. For reals A, B, A±B is used to denote some value in the interval [A−B, A+B]. Whenever a frequency x, is discussed, that frequency at the stream's end is being referred to. It is also assumes ∥x∥1≠0 without loss of generality (note ∥x∥1=0 iff ∥x∥2=0, and the latter can be detected with arbitrarily large constant probability in O(log(nmM)) space and O(1) update and reporting time by, say, the AMS sketch), and that ∈<∈0 for some fixed constant ∈0.
1 Streaming Algorithm
The 1 streaming algorithm used by the data stream analyzer 110 for (1±∈)-approximating ∥x∥1 is now discussed in greater detail. As discussed in above, the algorithm works by estimating the contribution to 1 from the heavy coordinates and non-heavy coordinates separately, then summing these estimates.
A “φ-heavy coordinate” is an index i such that that |xi|≧φ∥x∥1. A known heavy coordinate algorithm is used for the turnstile model of streaming (the model currently being operating in) to identify the ∈2-heavy coordinates. Given this information, a subroutine, Filter (discussed below), is used to estimate the contribution of these heavy coordinates to 1 up to an additive error of ∈∥x∥1. This takes care of the contribution from heavy coordinates. R=Θ(1/∈2) “buckets” Bi are maintained in parallel, which the contribution from non-heavy coordinates to be estimated. Each index in [n] is hashed to exactly one bucket i[R]. The ith bucket keeps track of the dot product of x, restricted to those indices hashed to i, with three random Cauchy vectors, a known unbiased estimator of 1 is applied due to Li (the “geometric mean estimator”) to estimate the 1 norm of x restricted to indices hashed to i. The estimates from the buckets not containing any ∈2-heavy coordinates are then sum up (some scaling of). The value of the summed estimates turns out to be approximately correct in expectation. Then, using that the summed estimates only come from buckets without heavy coordinates, it can be shown that the variance is also fairly small, which then shows that the estimation of the contribution from the non-heavy coordinates is correct up to ∈|x|1 with large probability.
The Filter Data Structure: Estimating the Contribution from Heavy Coordinates
In this section, it is assumed that a subset L⊂[n] of indices i is known so that (1) for all i for which |xi|≧∈2∥x∥1, iL, and (2) for all iL, |x1|≧(∈2/2)∥x∥1. Note this implies |L|≦2/∈2. Furthermore, it is also assumed that sign(xi) is known for each iL. Throughout this discussion, ahead denotes the vector x projected onto coordinates in L, so that Σi∈L|x|=∥xhead∥1. The culmination of this section is Theorem 3, which shows that an estimate Φ=∥xhead|1±∈∥x∥1 in small space with large probability can be obtained via a subroutine referred to herein as Filter. The following uniform has family construction given is used.
THEOREM 1. Let S⊂U=[u] be a set of z>1 elements, and let V=[v], with 1<v≦u.
Suppose the machine word size is Ω(log(u)). For any constant c>0 there is a word RAM algorithm that, using time log(z) logO(1)(v) and O(log(z)+log log(u)) bits of space, selects a family of functions from U to V (independent of S) such that:
The BasicFilter data structure can be defined as follows. Choose a random sign vector σ{−1, 1}n from a 4-wise independent family. Put r=[27/∈2]. A hash function h:[n]→[r] is chosen at random from a family constructed randomly as in Theorem 1 with u=n, v=z=r, c=1. Note |L|+1<z. Also, r counters b1, . . . , br are initialized to 0. Given an update of the form (i, v), add σ(i)·v to bh(i).
The Filter data structure is defined as follows. Initialize s=[log 3(1/∈2)]+3 independent copies of the BasicFilter data structure. Given an update (i, v), perform the update described above to each of the copies of BasicFilter. This data structure can be thought of as an s×r matrix of counters Di,j, i[s] and j[r]. The variable σi denotes the sign vector σ in the i-th independent instantiation of BasicFilter, and similarly define hi and i. Notice that the space complexity of Filter is O(∈−2 log(1/∈)log(mM)+log(1/∈)log log n), where O represents a constant C that is independent of n. The update time is O(log(1/∈)).
For each wL for which hi(w)=1, say a count Di,j is good for w if for all yL\{w}, hi(y)≠j. Since hi is |L|-wise independent when restricted to L with probability at least 1−1/r, Pr[Di,j is good for w]≧(1−1/r)·(1−(|i|−1)/r)≧⅔, where the second inequality holds for i≦1. It follows that since Filter is the concatenation of s independent copies of BasicFilter,
Let ∈ be the event of EQ. (1).
The following estimator Φ of ∥xhead∥1 is defined given the data in the Filter structure, together with the list L. It is also assumed that holds, else the estimator is not well-defined. For each wL, let i(w) be the smallest i for which Di,h
with σ being a random vector, each of its entries is either +1 or −1. Note that the Filter data structure comprises universal hashing replaced by uniform hashing, and has different estimation procedure that the CountSketch structure.
LEMMA 2: E[Φ|]=∥xhead∥1 and Var[Φ|]≦2∈2∥x∥12/9
Proof: By linearity expectation,
Fix a wL, and for notational convenience let i=i(w) and j=j(w). For each y[n], set Γ(y)=1 if hi(i)=j, and set Γ(y)=0 otherwise. Then
Consider any fixing of h′ subject to the occurrence of , and notice that σi is independent of hi. Since σi is 4-wise independent, it follows that
and hence
A bounding is now performed for Var[Φ|]=E[Φ2|]−E2[Φ|], or equivalently, the function shown in
Now consider a coordinate y∉L . For S⊂[n] let si be the event that i is |S|-wise independent when restricted to S. By Bayes' rule Pr[L∪{y}i|] is equal to that shown in
Hence,
E[D
i,j
2
|
]≦x
w
2+3∈2∥x∥12/r≦xw2+∈4∥x∥12/9
As |L|≦2∈−2, it follows that
Now turning to bounding
Fix distinct w, yL. Note that (i(w), j(w))≠(i(y), j(y)) conditioned on occurring. Suppose first that i(w)≠i(y), then the equality shown in
Now suppose that i(w)=i(y). Let i=i(w)=i(y) for notational convenience. Define the indicator random variable Γw(z)=1 if hi(z)=j(w), and similarly let Γy(z)=1 if hi(z)=j (y). Then the expression E[sign(xw)sign(xy)σi(w)(w)σi(w)(y)Di(w), j(w)Di(y), j(y)|] can be expanded using the definition Di(w), j(w) and Di(y), j(y) as:
The variables z and z′ are fixed and a summand of the form E[sign(xw)sign(xy)xzxz′Γw(z)Γy(z′)×σi(z)σi(z′)σi(w)σi(y)|] is analyzed.
Consider any fixing of hi subject to the occurrence of , and recall that σi is independent of hi. Since σj is 4-wise independent and a sign vector, it follows that this summand vanishes unless {z, z′} {w, y}. Moreover, since Γw(y)=Γy(w)=0, while Γw(w)=Γy(y)=1, then there must be the following, z=w and z′=y. In this case, E[sign(xw)sign(xy)xzxz′Γw(z)Γy(z′)×σi(z)σi(z′)σi(w)σi(y)|hi]=|xw|·|xy|.
Hence, the total contribution of all distinct w, yL to Var[Φ|] is at most Σw≠∈L|xw|·|xy|.
Combining the bounds, it follows that the equalities in
By Chebyshev's inequality, Lemma 2 implies
and thus
The above findings are summarized with the following theorem:
THEOREM 3: Suppose that is a set L⊂[n] of indices j so that (1) for all j for which |xj|≧υ2∥x∥1, j∈L and (2) for all j∈L, |xj|≧(∈2/2)∥x∥1. Further, suppose sign(xj) is known for each jL. Then, there is a 1-pass algorithm, Filter, which outputs an estimate for which with probability at least 7/10, |Φ−∥xhead∥1|≦∈∥x∥1. The space complexity of the algorithm is O(∈−2 log(1/∈)log(mM)+log(1/∈)log log n). The update time is O(log(1/∈), and the reporting time is O(∈−2 log(1/∈)).
The Final Algorithm
The final algorithm for (1±∈)-approximating ∥x∥1, which was outlined above is now analyzed. The full details of the algorithm are shown in
Definition 4: Let 0<γ<φ and δ>0 be given. In the 1 heavy coordinates problem, with probability at least 1−δ a list L⊂[n] is outputted such that:
Note that for γ≦φ/2, the last two items above imply sign(xi) can be determined for iL. For a generic algorithm solving the 1 heavy coordinates problem HHUpdate(φ), HHReport(φ), and HHSpace(φ) are used to denote update time, reporting time, and space, respectively, with parameter φ and γ=φ/2, δ=1/20.
There exist a few of solutions to the 1 heavy coordinates problem in the turnstile model. The work gives an algorithm with HHSpace(φ)=O(φ−1 log(mM)log(n)), HHUpdate(φ)=O(log(n)), and with HHReport(φ)=O(n log(n)), and gives an algorithm with HHSpace(φ)=O(φ−1 log(φn) log log(φn) log(1/φ) log(mM)), and with HHUpdate(φ)=O(log(φn) log log(n) log(1/φ), and HHReport(φ)=O(φ−1 log(φn) log log(φn) log(1/φ).
Also, the following theorem follows from Lemma 2.2 (with k=3 in their notation). In Theorem 5 (and in
THEOREM 5: For an integer n>0, let A1[j], . . . , An[j] be 3n independent Cauchy random variables for j=1, 2, 3. Let xRn be arbitrary. Then given Cj=Σi=1nAi[j]·xi for j=1, 2, 3, the estimator
satisfies the following two properties:
It is shown in Theorem 6 that the algorithm outputs (1±O(e))∥x∥1 with probability at least 3/5. Note this error term can be made c by running the algorithm with ∈′ being ∈ times a sufficiently small constant. Also, the success probability can be boosted to 1−δ by running O(log(1/δ)) instantiations of the algorithm in parallel and returning the median output across all instantiations.
THEOREM 6: The algorithm of
PROOF: Throughout this proof A is used to denote the 3n-tuple (A1[1], . . . , An[n], . . . , A1[3], . . . , An [3]), and for S⊂[n], S is the event that the hash family that is randomly selected in Step 3 via Theorem 1 is ISI-wise independent when restricted to S. For an event , 1∈ denotes the indicator random variable for . The variable xhead is used denote x projected onto the coordinates in L, and xtad is used to denote the remaining coordinates. Note ∥x∥1=∥xhead∥1+∥xtail∥1.
The following lemma will now be proved. The proof requires some care since h is not always a uniform hash function on small sets, but is only so on any particular (small) set with large probability.
LEMMA 7: Conditioned on the randomness of HH of
PROOF: For ρ=1−Pr(L), see
The above expectation is now computed conditioned on I. Let ′I be the event I=I′ for an arbitrary I′. Then, see
Note Pr[L∪{i}|L]≦[L∪{i}]/Pr[L]=ρ′i/(1−ρ) for ρ′i=1−Pr[L∪{i}]. Also, Pr[L∪{i}|L]≧Pr[L∪{i}] since Pr[L∪{i}] is a weighted average of Pr[L∪{i}|L] and Pr[L∪{i}|L], and the latter is 0. This for some ρ″i∈[0, ρ′i] EQ. (4) is
By the setting of c=2 when picking the hash family of Theorem 1 in Step 3, ρ, ρ′i, ρ″i=O(∈3)) for all I, and thus ρ′i=I (1−ρ)·R=O(∈), implying the above is (1±O(∈))∥xtail∥1. Plugging this into EQ. 3 then shows that the desired expectation is (1±O(∈))∥xtail∥1.
The expected variance of (R/|I|)·Σj∈I{tilde over (L)}1(j) is now bound.
LEMMA 8: Conditioned on HH being correct,
PROOF: For any fixed h, R/|I| is determined and the {tilde over (L)}1(j) are pairwise independent. Thus for fixed h,
First observe that since |I|≧R−|L|≧2/∈2, for any choice of h R/|I|≦2. Thus, up to a constant factor, the expectation that is trying to be computed is
For notational convenience, {tilde over (L)}1(j)=0 if j≠I . Now see
An essentially identical calculation, but conditioning on L∪{i,i′} instead of L∪{i}, gives that Prh[(h(i)=j)(h(i′)=j)|j∈I]=O(1/R2). Combining these bounds with Eq. 5, the expected variance that is trying to be computed is O(∥x∥tail∥22+∥xtail|12/R).
The second summand is O(∈2∥x∥12). For the first summand, conditioned on HH being correct, every |xi| for i∉L has |xi|≦∈2∥x∥1. Under this constraint, ∥xtail∥22 maximized when there are exactly 1/∈2 coordinates i∈L each with |xi|=∈2∥x∥1 in which case ∥xtail|22=∈2∥x∥12.
The proof of correctness of the full algorithm shown in
Next, the quantity
is looked at.
By Lemma 7, E[X], even conditioned on the randomness used by HH to determined L, is (1±O(∈))∥xtail∥1. Also conditioned on ∈HH the expected value of Var[X] for a random h is O(∈2∥x∥12). Since Var[X] is always non-negative, Markov's bound applies and Var[X]=O(∈2∥x∥12) with probability at least 19/20 (over the randomness in selecting h).
Thus, by Chebyshev's inequality,
which can be made at most 1/15 by setting t a sufficiently large constant. Call the event in EQ. 6 . Then, as long as HH, F, occur, the final estimate of ∥x∥1 is (1±O(∈))(∥xtail∥1+∥xhead∥1±O(∈∥x∥1)=(1±O∈∥x∥1) as desired. The probability of correctness is then at least that shown in
Remark 9: It is known from previous work, that each Ai[j] can be maintained up to only O(log(n/∈)) bits of precision, and requires the same amount of randomness to generate, to preserve the probability of correctness to within an arbitrarily small constant. Then, note that the counters Bi[j] each only consume O(log(nmM/∈)) bits of storage.
Given Remark 9, the following theorem is given.
Theorem 10: Ignoring the space to store the Ai[j], the overall space required for the algorithm of
PROOF: Ignoring F and HH, the update time is O(1) to compute h, and O(1) to update the corresponding Bh(i). Also ignoring F and HH, the space required is O(∈−2 log(nmM/∈)) to store all the Bi[j] (Remark 9), and O(∈−2 log(1/∈+log log(n)) bits to store h and randomly select the hash family it comes from (Theorem 1). The time to compute the final line in the estimator, given L and ignoring the time to compute Φ, is O(1/∈). The bounds stated above then take into account the complexities of F and HH.
Derandomizing the Final Algorithm
Observe that a naive implementation of storing the entire tuple A in
First, recall the definition of a finite state machine (FSM). An FSM M is parameterized by a tuple (Tinit, S, Γ, n). The FSM M is always in some “state”, which is just a string x{0, 1}S, and it starts in the state Tinit. The parameter Γ is a function mapping {0, 1}S×{0, 1}n→{0, 1}S. Notation is abused and for x({0, 1}n)r for r a positive integer, Γ(T, x) is used to denote Γ( . . . (Γ(Γ(T, x1), x2), . . . ), xr). Note that given a distribution D over ({0, 1}n)r, there is an implied distribution M(D) over {0, 1}S obtained as Γ(Tinit, D).
DEFINITION 11: Let t be a positive integer. For D, D′ two distributions on {0,1}t, the total variation distance Δ(D, D′) is defined by
THEOREM 12. Let Ut denote the uniform distribution on {0, 1}t. For any positive integers r, n, and for some S=Θ(n), there exists a function Gnisan=: {0,1}s→({0, 1}n)r with s=O(S log(r)) such that for any FSM M=(Tinit,S,T,n), Δ(M((Un)r), M(Gnisan(US)))≦=2−S.
Furthermore, for any x{0, 1}s and i[r], computing the n-bit block Gnisan(x)i requires O(S log(r)) space and O(log(r)) arithmetic operations on O(S)-bit words.
Before finally describing how Theorem 12 fits into a de-randomization of
LEMMA 13: If X1, . . . , Xm are independent and Y1, . . . , Ym are independent, then
Now, the derandomization of
By Theorem 12, if rather than defining A by 3tr truly random bits (for r=n) it is defined instead by stretching a seed of length s=O(S log(n))=O(log(nmM/∈) log(n)) via Gnisan, then the distribution on the state of Bu at the end of the stream changes by at most a total variation distance of 2−S. Now, suppose R independent seeds are used to generate different A vectors in each of the R buckets. Note that since each index i[n] is hashed to exactly one bucket, the Ai[j] across each bucket need not be consistent to preserve the behavior of our algorithm. Then for Ut being the uniform distribution on {0, 1}t,
Δ(M1(U3t)rx . . . xMR(U3t)r
M
1(Gnisan(US))x . . . xMR(US)))≦R·2−S
By increasing S by a constant factor, R·2−S can be ensured to be an arbitrarily small constant δ. Now, note that the product measure on the output distributions of the Mu corresponds exactly to the state of the entire algorithm at the end of the stream. Thus, if one considers T to be the set of states (B1, . . . , BR) for which the algorithm outputs a value (1±∈)∥x∥1 (i.e., is correct), by definition of total variation distance (Definition 11), the probability of correctness of the algorithm changes by at most an additive δ when using Nisan's PRG instead of uniform randomness. Noting that storing R independent seeds just takes Rs space, and that the time required to extract any Ai[j] from a seed requires O(log(n)) time by Theorem 12, then there is the following theorem.
THEOREM 14: Including the space and time complexities of storing and accessing the Ai[j], the algorithm of
Therefore, as can be seen from the above discussion, one or more embodiments provide 1-pass algorithm using ∈−2 polylog(nmM) space for 1-estimation in data streams with polylog(nmM) update time, and reporting time ∈2 polylog(nmM). This algorithm is the first to be simultaneously optimal in both the space and the update time up to polylog(nmM) factors. Conventional algorithms either required at least ∈−3 polylog(nmM) bits of space, or at least ∈2 update time. As ∈ can be arbitrarily small, the result of one or more embodiments can provide a substantial benefit over conventional algorithms. In light of known lower bounds, the space and time complexity of these one or more embodiments are optimal up to polylog(nmM) factors.
Operational Flow
Aspects of the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment (including firmware, resident software, micro-code, etc.), or an embodiment combining software and hardware aspects that may all generally be referred to herein as a “circuit,” “module” or “system. Also, aspects of the present invention have been discussed above with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the invention. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems that perform the specified functions or acts, or combinations of special purpose hardware and computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer readable medium that can direct a computer, other programmable data processing apparatus, or other devices to function in a particular manner, such that the instructions stored in the computer readable medium produce an article of manufacture including instructions which implement the function/act specified in the flowchart and/or block diagram block or blocks.
The computer program instructions may also be loaded onto a computer, other programmable data processing apparatus, or other devices to cause a series of operational steps to be performed on the computer, other programmable apparatus or other devices to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide processes for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks.
Any combination of one or more computer readable medium(s) may be utilized. The computer readable medium may be a computer readable signal medium or a computer readable storage medium. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples (a non-exhaustive list) of the computer readable storage medium include computer diskette, a hard disk, a random access memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or Flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. A computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device.
A computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device.
Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.
Computer program code for carrying out operations for aspects of the present invention may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, Smalltalk, C++ or the like and conventional procedural programming languages, such as the “C” programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the latter scenario, the remote computer may be connected to the user's computer through any type of network, including a local area network (LAN) or a wide area network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet Service Provider).
The terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. As used herein, the singular forms “a”, “an” and “the” are intended to include the plural forms as well, unless the context clearly indicates otherwise. It will be further understood that the terms “comprises” and/or “comprising,” when used in this specification, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.
The description of the present invention has been presented for purposes of illustration and description, but is not intended to be exhaustive or limited to the invention in the form disclosed. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the invention. The embodiments above were chosen and described in order to best explain the principles of the invention and the practical application, and to enable others of ordinary skill in the art to understand the invention for various embodiments with various modifications as are suited to the particular use contemplated.