Aggregate contribution of iceberg queries

Information

  • Patent Grant
  • 8495087
  • Patent Number
    8,495,087
  • Date Filed
    Tuesday, February 22, 2011
    13 years ago
  • Date Issued
    Tuesday, July 23, 2013
    11 years ago
Abstract
One or more embodiments determine a distance between at least two vectors of n coordinates. A set of heavy coordinates is identified from a set of n coordinates associated with at least two vectors. A set of light coordinates is identified from the set of n coordinates associated with the at least two vectors. A first estimation of a contribution is determined from the set of heavy coordinates to a rectilinear distance between the at least two vectors. A second estimation of a contribution is determined from the set of light coordinates to the rectilinear distance norm. The first estimation is combined with the second estimation.
Description
BACKGROUND

The present invention generally relates to data streams, and more particularly relates to measuring distance between data in a data stream.


Recent years have witnessed an explosive growth in the amount of available data. Data stream algorithms have become a quintessential tool for analyzing such data. These algorithms have found diverse applications, such as large scale data processing and data warehousing, machine learning, network monitoring, and sensor networks and compressed sensing. A key ingredient in all these applications is a distance measure between data. In nearest neighbor applications, a database of points is compared to a query point to find the nearest match. In clustering, classification, and kernels, e.g., those used for support vector machines (SVM), given a matrix of points, all pairwise distances between the points are computed. In network traffic analysis and denial of service detection, global flow statistics computed using Net-Flow software are compared at different times via a distance metric. Seemingly unrelated applications, such as the ability to sample an item in a tabular database proportional to its weight, i.e., to sample from the forward distribution, or to sample from the output of a SQL Join, require a distance estimation primitive for proper functionality.


One of the most robust measures of distance is the custom character1-distance (rectilinear distance), also known as the Manhattan or taxicab distance. The main reason is that this distance is robust is that it less sensitive to outliers. Given vectors x, y∈custom charactern, the custom character1-distance is defined as










x
-
y



1



=
def






i
=
1

n











x
i

-

y
i




.







This measure, which also equals twice the total variation distance, is often used in statistical applications for comparing empirical distributions, for which it is more meaningful and natural than Euclidean distance. The custom character1-distance also has a natural interpretation for comparing multisets, whereas Euclidean distance does not. Other applications of custom character1 include clustering, regression (and with applications to time sequences), Internet-traffic monitoring, and similarity search. In the context of certain nearest-neighbor search problems, “the Manhattan distance metric is consistently more preferable than the Euclidean distance metric for high dimensional data mining applications”. The custom character1-distance may also support faster indexing for similarity search.


Another application is with respect to estimating cascaded norms of a tabular database, i.e. the custom characterp norm on a list of attributes of a record is first computed, then these values are summed up over records. This problem is known as custom character1(custom characterp) estimation. An example application is in the processing of financial data. In a stock market, changes in stock prices are recorded continuously using a rlog quantity known as logarithmic return on investment. To compute the average historical volatility of the stock market from the data, the data is segmented by stock, the variance of the rlog values are computed for each stock, and then these variances are averaged over all stocks. This corresponds to an custom character1 (custom character2) computation (normalized by a constant). As a subroutine for computing custom character1(custom character2), the best known algorithms use a routine for custom character1-estimation.


BRIEF SUMMARY

In one embodiment, a method for determining a distance between at least two vectors of n coordinates is disclosed. The method comprises identifying a set of heavy coordinates from a set of n coordinates associated with at least two vectors. A heavy coordinate is represented as |xi|≧∈2∥x∥1, where x is a vector, i is a coordinate in the set of n coordinates, and c is an arbitrary number. A set of light coordinates is identified from the set of n coordinates associated with the at least two vectors, wherein a light coordinate is represented as |xi|<∈2∥x∥1. A first estimation of a contribution is determined from the set of heavy coordinates to a rectilinear distance between the at least two vectors. A second estimation of a contribution is determined from the set of light coordinates to the rectilinear distance norm. The first estimation is combined with the second estimation.


In another embodiment, an information processing system for determining a distance between at least two vectors of n coordinates is disclosed. The information processing system comprises a memory and a processor that is communicatively coupled to the memory. A data stream analyzer is communicatively coupled to the processor and the memory. The data stream analyzer is configured to perform a method. The method comprises identifying a set of heavy coordinates from a set of n coordinates associated with at least two vectors. A heavy coordinate is represented as |xi|≧∈2∥x∥1, where x is a vector, i is a coordinate in the set of n coordinates, and c is an arbitrary number. A set of light coordinates is identified from the set of n coordinates associated with the at least two vectors, wherein a light coordinate is represented as |xi|<∈2∥x∥1. A first estimation of a contribution is determined from the set of heavy coordinates to a rectilinear distance between the at least two vectors. A second estimation of a contribution is determined from the set of light coordinates to the rectilinear distance norm. The first estimation is combined with the second estimation.


In yet another embodiment, a computer program product for determining a distance between at least two vectors of n coordinates is disclosed. The computer program product comprises a storage medium readable by a processing circuit and storing instructions for execution by the processing circuit for performing a method comprising. The method comprises identifying a set of heavy coordinates from a set of n coordinates associated with at least two vectors. A heavy coordinate is represented as |xi|≧∈2∥x∥1, where x is a vector, i is a coordinate in the set of n coordinates, and ∈ is an arbitrary number. A set of light coordinates is identified from the set of n coordinates associated with the at least two vectors, wherein a light coordinate is represented as |xi|<∈2∥x∥1. A first estimation of a contribution is determined from the set of heavy coordinates to a rectilinear distance between the at least two vectors. A second estimation of a contribution is determined from the set of light coordinates to the rectilinear distance norm. The first estimation is combined with the second estimation.





BRIEF DESCRIPTION OF THE SEVERAL VIEWS OF THE DRAWINGS

The accompanying figures where like reference numerals refer to identical or functionally similar elements throughout the separate views, and which together with the detailed description below are incorporated in and form part of the specification, serve to further illustrate various embodiments and to explain various principles and advantages all in accordance with the present invention, in which:



FIG. 1 is a block diagram illustrating one example of an operating environment comprising an adaptive search personalization system according to one embodiment of the present invention;



FIG. 2 shows a bounding of Var[Φ|custom character]=E[Φ2|custom character]−E2[Φ|custom character] according to one embodiment of the present invention;



FIG. 3 shows an equality for E[Di,j2|custom character] according to one embodiment of the present invention;



FIG. 4 shows an equality for Pr[custom characterL∪{y}i|custom character] according to one embodiment of the present invention;



FIG. 5 shows an equality for E[sign(xw)sign(xyi(w)(w)σi(y)(y) Di(w), j(w)Di(y),j(y)|custom character] according to one embodiment of the present invention;



FIG. 6 shows another equality when a set of bounds are combined according to one embodiment of the present invention;



FIG. 7 shows one example of pseudocode of an custom character1-estimation according to one embodiment of the present invention;



FIG. 8 shows a proof for Lemma 7 according to one embodiment of the present invention;



FIG. 9 shows an equality for







E

A
,
h




[




R


I



·




j

I










L
~

1



(
j
)




|


L


,

I



]






according to one embodiment of the present invention;



FIG. 10 shows an equality for Prh[(h(i)=j)custom charactercustom characterl|custom characterL] according to one embodiment of the present invention;



FIG. 11 shows another equality according to one embodiment of the present invention;



FIG. 12 shows an equality for Prh[h(i)=j|j∈I] according to one embodiment of the present invention;



FIG. 13 shows yet another equality based on Bayes' theorem according to one embodiment of the present invention;



FIG. 14 shows a probability of correctness according to one embodiment of the present invention; and



FIG. 15 is an operational flow diagram illustrating one example of a process for determination a distance between at least two vectors of n coordinates according to one embodiment of the present invention.





DETAILED DESCRIPTION
Operating Environment


FIG. 1 shows one example of an operating environment 100 applicable to various embodiments of the present invention. In particular, FIG. 1 shows a computer system/server 102 that is operational with numerous other general purpose or special purpose computing system environments or configurations. Examples of well-known computing systems, environments, and/or configurations that may be suitable for use with computer system/server 102 include, but are not limited to, personal computer systems, server computer systems, thin clients, thick clients, hand-held or laptop devices, multiprocessor systems, microprocessor-based systems, set top boxes, programmable consumer electronics, network PCs, minicomputer systems, mainframe computer systems, and distributed cloud computing environments that include any of the above systems or devices, and the like. Computer system/server 102 may be described in the general context of computer system-executable instructions, such as program modules, being executed by a computer system. Generally, program modules may include routines, programs, objects, components, logic, data structures, and so on that perform particular tasks or implement particular abstract data types.


As shown in FIG. 1, a computer system/server 102 is shown in the form of a general-purpose computing device. The components of computer system/server 102 can include, but are not limited to, one or more processors or processing units 104, a system memory 106, and a bus 108 that couples various system components including system memory 106 to processor 104. Bus 108 represents one or more of any of several types of bus structures, including a memory bus or memory controller, a peripheral bus, an accelerated graphics port, and a processor or local bus using any of a variety of bus architectures. By way of example, and not limitation, such architectures include Industry Standard Architecture (ISA) bus, Micro Channel Architecture (MCA) bus, Enhanced ISA (EISA) bus, Video Electronics Standards Association (VESA) local bus, and Peripheral Component Interconnects (PCI) bus.


Computer system/server 102 typically includes a variety of computer system readable media. Such media may be any available media that is accessible by computer system/server 1002, and it includes both volatile and non-volatile media, removable and non-removable media. System memory 106, in one embodiment, comprises a data stream analyzer 110 that performs one or more of the embodiments discussed below with respect to measuring distance between data. It should be noted that the data stream analyzer 110 can also be implemented in hardware as well. The system memory 106 can include computer system readable media in the form of volatile memory, such as random access memory (RAM) 112 and/or cache memory 114.


Computer system/server 102 can further include other removable/non-removable, volatile/non-volatile computer system storage media. By way of example only, storage system 116 can be provided for reading from and writing to a non-removable, non-volatile magnetic media (not shown and typically called a “hard drive”). Although not shown, a magnetic disk drive for reading from and writing to a removable, non-volatile magnetic disk (e.g., a “floppy disk”), and an optical disk drive for reading from or writing to a removable, non-volatile optical disk such as a CD-ROM, DVD-ROM or other optical media can be provided. In such instances, each can be connected to bus 108 by one or more data media interfaces. As will be further depicted and described below, memory 106 may include at least one program product having a set (e.g., at least one) of program modules that are configured to carry out the functions of embodiments of the invention.


Program/utility 118, having a set (at least one) of program modules 120, may be stored in memory 106 by way of example, and not limitation, as well as an operating system, one or more application programs, other program modules, and program data. Each of the operating system, one or more application programs, other program modules, and program data or some combination thereof, may include an implementation of a networking environment. Program modules 120 generally carry out the functions and/or methodologies of embodiments of the invention as described herein.


Computer system/server 102 may also communicate with one or more external devices 122 such as a keyboard, a pointing device, a display 124, etc.; one or more devices that enable a user to interact with computer system/server 126; and/or any devices (e.g., network card, modem, etc.) that enable computer system/server 102 to communicate with one or more other computing devices. Such communication can occur via I/O interfaces 126. Still yet, computer system/server 102 can communicate with one or more networks such as a local area network (LAN), a general wide area network (WAN), and/or a public network (e.g., the Internet) via network adapter 128. As depicted, network adapter 1026 communicates with the other components of computer system/server 102 via bus 108. It should be understood that although not shown, other hardware and/or software components could be used in conjunction with computer system/server 102. Examples, include, but are not limited to: microcode, device drivers, redundant processing units, external disk drive arrays, RAID systems, tape drives, and data archival storage systems, etc.


Overview


The inventors paper entitled “Fast Manhattan Sketches in Data Streams”, by Jelani Nelson and David P. Woodruff, ACM PODS '10 Indiana, Ind., USA which is hereby incorporated by reference in its entirety. As discussed above, the custom character1-distance, also known as the Manhattan or taxicab distance, between two vectors x, y in custom charactern is Σi=1n|xi−yi|. Approximating this distance is a fundamental primitive on massive databases, with applications to clustering, nearest neighbor search, network monitoring, regression, sampling, and support vector machines. One or more embodiments of the present invention are directed to the problem of estimating the custom character1-distance in the most general turnstile model of data streaming.


Formally, given a total of m updates (positive or negative) to an n-dimensional vector x, one or more embodiments maintain a succinct summary, or sketch, of what has been seen so that at any point in time the data stream analyzer can output an estimate E(x) so that with high probability, (1−∈)∥x∥1≦E(x)≦(1−∈)∥x∥1, where ∈>0 is a tunable approximation parameter. Here, an update has the form (i, v), meaning that the value v should be added to coordinate i. One or more embodiments assume that v is an integer (this is without loss of generality by scaling), and that |v|≦M, where M is a parameter. Updates can be interleaved and presented in an arbitrary order. Of interest is the amount of memory to store the sketch, the amount of time to process a coordinate update, and the amount of time to output an estimate upon request.


One or more embodiments of the present invention are advantageous because they give the first 1-pass streaming algorithm for this problem in the turnstile model with O*(∈−2) space and O*(1) update time where the bounds are optimal up to O*(1) factors. The O* notation hides polylogarithmic factors in ∈, n, and the precision required to store vector entries. In particular, one or more embodiments provide 1-pass algorithm using ∈−2 polylog(nmM) space for custom character1-estimation in data streams with polylog(nmM) update time, and reporting time ∈−2 polylog(nmM). This algorithm is simultaneously optimal in both the space and the update time up to polylog(nmM) factors. Conventional algorithms either required at least ∈−3 polylog(nmM) bits of space, or at least ∈−2 update time. As ∈ can be arbitrarily small, the result of one or more embodiments can provide a substantial benefit over conventional algorithms. In light of known lower bounds, the space and time complexity of these one or more embodiments are optimal up to polylog(nmM) factors.


It should be noted that in the following discussion, for a function ƒ the notation O*(ƒ) is used to denote a function g=O(ƒ·polylog (nmM|∈)). Θ* and Ω* are similarly defined.


The improvements provided by one or more embodiments of the present invention result in corresponding gains for the aforementioned applications. Examples include the scan for nearest neighbor search, for which to obtain sketches of size O* (∈−2), these embodiments reduce the preprocessing time from O(nd∈−2) to O* (nd). These embodiments also shave an ∈−2 factor in the time for computing all pairwise custom character1-distance, in the update time for sampling from the forward distribution, in the time for comparing two collections of traffic-flow summaries, and in the time for estimating cascaded norms.


Techniques


Using the Cauchy sketches of Li (particularly, the geometric mean estimator) would require Ω*(∈−2) update time. Multi-level sketches can be used, incurring an extra Ω*(∈−1) factor in the space. Various embodiments of the present invention achieve O*(1) update time by using Cauchy sketches (and particularly, Li's geometric mean estimator). However, to achieve this result one or more embodiments preprocess and partition the data, as discussed in greater detail below.


A Cauchy sketch is now described. Given a vector x, the sketch is a collection of counters








Y
j

=





i
=
1

n




x
1



C

i
,
j







for





j


=
1


,





,
k
,





where the are standard Cauchy random variables with probability density function







μ


(
y
)


=


1

π


(

1
+

y
2


)



.






The Ci,j are generated pseudo-randomly using a pseudo-random generator (PRG). By the 1-stability of the Cauchy distribution, Yj is also distributed as a standard Cauchy random variable, scaled by ∥x∥1. Li shows that there is a constant ck>0 so that for any k≧3, if k≧3, if Y1, . . . , Yk are independent Cauchy sketches, then the geometric mean estimator

EstGM=ck·(|Y1|·|Y2| . . . |Yk|)1/k

has an expected value E[EstGM]=∥x∥1 a variance of Var[EstGM]=Θ(∥x∥12/k). The space and time complexity of maintaining the Y3 in a data stream are O*(k), and by linearity, can be computed in a single pass. By Chebyshev's inequality, for k=Θ(∈−2) one obtains a (1±∈)-approximation to ∥x∥1 with constant probability, which can be amplified by taking the median of independent repetitions. While the space needed is O*(∈−2), so is the update time.


The starting point of one or more embodiments is the following idea. Suppose the coordinates into O(∈−2) are randomly partitioned into buckets. In each bucket Li's estimator is maintained, but only with parameter k=3. Given an update to a coordinate i, it lands in a unique bucket, and the contents of this bucket can be updated in O*(1) time. Using Θ(∈−2) buckets, the space is also O*(∈−2). One is then faced with the following temptation: letting Gi be the estimate returned by Li's procedure in bucket i for k=3, output






G
=




i
=
1

r




G
i

.







From the properties of the Gi, this is correct in expectation.


The main wrinkle is that Var[G] can be as large as Ω(∥x∥12), which is not good enough. To see that this can happen, suppose x contains only a single non-zero coordinate x1=1. In the bucket containing x1, the value G of Li's estimator is the geometric mean of 3 standard Cauchy random variables. By the above, Var[G]=Θ(∥x∥12/k)=Θ(∥x∥12).


Note though in the above example, x1 contributed a large fraction of the custom character1 mass of x (in fact, all of it). The main idea of one or more embodiments then is the following. A φ-heavy coordinate of the vector x is a coordinate i for which |xi|≧φ·∥x∥1. Algorithms for finding heavy coordinates, also known as iceberg queries, have been extensively studied in the database community, and such algorithms in the algorithm of one or more embodiments of the present invention. Set φ=∈2. Every φ-heavy coordinate is removed from x, the contribution of these heavy coordinates are estimated separately, then the bucketing above is used on the remaining coordinates. This reduces Var[G] to O(∥xtail22), where xtail is the vector obtained from x by removing the heavy coordinates. A calculation shows that O(∥xtail22)=O(∈2∥x∥12), which is good enough to argue that ∥xtail1 can be estimated to within an additive ∈∥x∥1 with constant probability. This idea can be implemented in a single pass.


The main remaining hurdle is estimating ∥xhead1, the contribution to ∥x∥1 from the heavy coordinates. Using current techniques the CountMin sketch, can be used to estimate the value of each ∈2-heavy coordinate up to an additive ∈3∥x∥1. Summing the estimates gives ∥xhead1 up to an additive ∈∥x∥1. This, however, requires Ω*(∈−3) space, which, in some embodiments, cannot be afforded. Instead, a new subroutine, Filter, is designed that estimates the sum of the absolute values of the heavy coordinates, i.e., the value ∥xhead1, up to an additive ∈∥x∥1, without guaranteeing an accurate frequency estimate to any individual heavy coordinate. This relaxed guarantee is sufficient for correctness of our overall algorithm, and is implementable in O*(∈−2) space.


Other technical complications arise due to the fact that the partitioning is not truly random, nor is the randomness used by Li's estimator. Therefore, one or more embodiments use a family that is close to an O(∈−2)-wise independent family, but doesn't suffer the O(∈−2) evaluation time required of functions in such families (e.g., O(∈−2)-degree polynomial evaluation). These functions can be evaluated in constant time. The caveat is that the correctness analysis needs more attention.


Preliminaries


The algorithm used by the data stream analyzer 110 operates, in one embodiment, in the following model. A vector x of length n is initialized to 0, and it is updated in a stream of m updates from the set [n]×{−M, . . . , M}. An update (i, v) corresponds to the change xi←xi+v. In one embodiment, a (1±∈)-approximation to









x


1

=




i
=
1

n





x
i









is computed for some given parameter ∈>0. All space bounds in this discussion are in bits, and all logarithms are base 2, unless explicitly stated otherwise. Running times are measured as the number of standard machine word operations (integer arithmetic, bitwise operations, and bitshifts). A differentiation is made between update time, which is the time to process a stream update, and reporting time, which is the time required to output an answer. Each machine word is assumed to be Ω(log(nmM/∈)) bits so that index each vector can be indexed and arithmetic can be performed on vector entries and the input approximation parameter in constant time.


Throughout this discussion, for integer z, [z] is used to denote the set {1, . . . , z}. For reals A, B, A±B is used to denote some value in the interval [A−B, A+B]. Whenever a frequency x, is discussed, that frequency at the stream's end is being referred to. It is also assumes ∥x∥1≠0 without loss of generality (note ∥x∥1=0 iff ∥x∥2=0, and the latter can be detected with arbitrarily large constant probability in O(log(nmM)) space and O(1) update and reporting time by, say, the AMS sketch), and that ∈<∈0 for some fixed constant ∈0.



custom character
1 Streaming Algorithm


The custom character1 streaming algorithm used by the data stream analyzer 110 for (1±∈)-approximating ∥x∥1 is now discussed in greater detail. As discussed in above, the algorithm works by estimating the contribution to custom character1 from the heavy coordinates and non-heavy coordinates separately, then summing these estimates.


A “φ-heavy coordinate” is an index i such that that |xi|≧φ∥x∥1. A known heavy coordinate algorithm is used for the turnstile model of streaming (the model currently being operating in) to identify the ∈2-heavy coordinates. Given this information, a subroutine, Filter (discussed below), is used to estimate the contribution of these heavy coordinates to custom character1 up to an additive error of ∈∥x∥1. This takes care of the contribution from heavy coordinates. R=Θ(1/∈2) “buckets” Bi are maintained in parallel, which the contribution from non-heavy coordinates to be estimated. Each index in [n] is hashed to exactly one bucket icustom character[R]. The ith bucket keeps track of the dot product of x, restricted to those indices hashed to i, with three random Cauchy vectors, a known unbiased estimator of custom character1 is applied due to Li (the “geometric mean estimator”) to estimate the custom character1 norm of x restricted to indices hashed to i. The estimates from the buckets not containing any ∈2-heavy coordinates are then sum up (some scaling of). The value of the summed estimates turns out to be approximately correct in expectation. Then, using that the summed estimates only come from buckets without heavy coordinates, it can be shown that the variance is also fairly small, which then shows that the estimation of the contribution from the non-heavy coordinates is correct up to ∈|x|1 with large probability.


The Filter Data Structure: Estimating the Contribution from Heavy Coordinates


In this section, it is assumed that a subset L[n] of indices i is known so that (1) for all i for which |xi|≧∈2∥x∥1, icustom characterL, and (2) for all icustom characterL, |x1|≧(∈2/2)∥x∥1. Note this implies |L|≦2/∈2. Furthermore, it is also assumed that sign(xi) is known for each icustom characterL. Throughout this discussion, ahead denotes the vector x projected onto coordinates in L, so that Σi∈L|x|=∥xhead1. The culmination of this section is Theorem 3, which shows that an estimate Φ=∥xhead|1±∈∥x∥1 in small space with large probability can be obtained via a subroutine referred to herein as Filter. The following uniform has family construction given is used.


THEOREM 1. Let SU=[u] be a set of z>1 elements, and let V=[v], with 1<v≦u.


Suppose the machine word size is Ω(log(u)). For any constant c>0 there is a word RAM algorithm that, using time log(z) logO(1)(v) and O(log(z)+log log(u)) bits of space, selects a family custom character of functions from U to V (independent of S) such that:






    • 1. With probability 1−O(1/zc), custom character is z-wise independent when restricted to S.

    • 2. Any hcustom charactercustom character can be represented by a RAM data structure using O(z log(v)) bits of space, and h can be evaluated in constant time after an initialization step taking O(z) time.





The BasicFilter data structure can be defined as follows. Choose a random sign vector σcustom character{−1, 1}n from a 4-wise independent family. Put r=[27/∈2]. A hash function h: [n]→[r] is chosen at random from a family custom character constructed randomly as in Theorem 1 with u=n, v=z=r, c=1. Note |L|+1<z. Also, r counters b1, . . . , br are initialized to 0. Given an update of the form (i, v), add σ(i)·v to bh(i).


The Filter data structure is defined as follows. Initialize s=[log 3(1/∈2)]+3 independent copies of the BasicFilter data structure. Given an update (i, v), perform the update described above to each of the copies of BasicFilter. This data structure can be thought of as an s×r matrix of counters Di,j, icustom character[s] and jcustom character[r]. The variable σi denotes the sign vector σ in the i-th independent instantiation of BasicFilter, and similarly define hi and custom characteri. Notice that the space complexity of Filter is O(∈−2 log(1/∈)log(mM)+log(1/∈)log log n), where O represents a constant C that is independent of n. The update time is O(log(1/∈)).


For each wcustom characterL for which hi(w)=1, say a count Di,j is good for w if for all ycustom characterL\{w}, hi(y)≠j. Since hi is |L|-wise independent when restricted to L with probability at least 1−1/r, Pr[Di,j is good for w]≧(1−1/r)·(1−(|i|−1)/r)≧⅔, where the second inequality holds for i≦1. It follows that since Filter is the concatenation of s independent copies of BasicFilter,










Pr


[




w

L


,



i



[
s
]


for





which






D

i
,


h
i



(
w
)





is





good





for





w




]








Pr


[




w

L


,



i



[
s
]


for





which






D

i
,


h
i



(
w
)





is





good





for





w




]








1
-



L


·

(

1

3
s


)



>

9
10





(

EQ
.




1

)








Let ∈ be the event of EQ. (1).


The following estimator Φ of ∥xhead1 is defined given the data in the Filter structure, together with the list L. It is also assumed that custom character holds, else the estimator is not well-defined. For each wcustom characterL, let i(w) be the smallest i for which Di,hi(w) is good for w, and let j(w)=hi(w)(i). The estimator is then






Φ
=




w

L









sign


(

x
w

)


·


σ

i


(
w
)





(
w
)


·


D


i


(
w
)


,

j


(
w
)




.








with σ being a random vector, each of its entries is either +1 or −1. Note that the Filter data structure comprises universal hashing replaced by uniform hashing, and has different estimation procedure that the CountSketch structure.


LEMMA 2: E[Φ|custom character]=∥xhead1 and Var[Φ|custom character]≦2∈2∥x∥12/9


Proof: By linearity expectation,







E
[

Φ
|

]

=




w

L








E
[



sign


(

x
w

)


·


σ

i


(
w
)





(
w
)


·

D


i


(
w
)


,

j


(
w
)





|

]







Fix a wcustom characterL, and for notational convenience let i=i(w) and j=j(w). For each ycustom character[n], set Γ(y)=1 if hi(i)=j, and set Γ(y)=0 otherwise. Then








E


σ
i

,

h
i



[



sign


(

x
w

)


·


σ
i



(
w
)


·

D

i
,
j



|

]

=



y








E


σ
i

,

h
i



[



sign


(

x
w

)




x
y



Γ


(
y
)





σ
i



(
y
)





σ
i



(
w
)



|

]






Consider any fixing of h′ subject to the occurrence of custom character, and notice that σi is independent of hi. Since σi is 4-wise independent, it follows that












E

σ
i


[



sign


(

x
w

)


·


σ
i



(
w
)


·

D

i
,
j



|

h
i


]

=



E


σ
i

,

h
i





[


sign


(

x
w

)




x
w



Γ


(
w
)





σ
i



(
w
)





σ
i



(
w
)



]


=



x
w





,




(

EQ
.




2

)








and hence







E


[

Φ
|

]


=





w

L






x
w




=




x
head



1






A bounding is now performed for Var[Φ|custom character]=E[Φ2|custom character]−E2[Φ|custom character], or equivalently, the function shown in FIG. 2. First Σw∈LE[Di(w), j(w)2|custom character] is bound. A wcustom characterL is fixed, and for notational convenience, put i(w) and j=j(w). Then E[Di,j2|custom character] is equal to that shown in FIG. 3, where the second equality follows from the fact that σj is 4-wise independent and independent of custom character. Note Pr[hi(y)=j|custom character]=0 for any y∈(L\{w}), and Pr[hi(w)=j|custom character]=1 by definition.


Now consider a coordinate y∉L . For S[n] let custom charactersi be the event that custom characteri is |S|-wise independent when restricted to S. By Bayes' rule Pr[custom characterL∪{y}i|custom character] is equal to that shown in FIG. 4. Conditioned on custom characterL∪{y}i, the value hi(y) is uniformly random even given the images of all members in L under hi. Thus, Pr[hi(y)=j|custom character]≦10/(9r)+1/r<3/r. Since the bucket is good for w, the total contribution of such y to E[Di,j2|custom character] is at most 3·∥xtail22/r, where xtail is the vector x with the coordinates in L removed. The ∥xtail22 is maximized when there are ∈−2 coordinates each of magnitude ∈2∥x∥1. In this case ∥xtail22=∈2∥x∥12.


Hence,

E[Di,j2|custom character]≦xw2+3∈2∥x∥12/r≦xw2+∈4∥x∥12/9


As |L|≦2∈−2, it follows that










w

L




E


[


D

i
,

(
w
)

,

j


(
w
)



2

|

]






2


ɛ
2






x


1
2

/
9


+




w

L




x
w
2







Now turning to bounding











w

y


L




E
[


sign


(

x
w

)




sign


(

x
y

)





σ

i


(
w
)





(
w
)





σ

i


(
y
)





(
y
)


×

D


i


(
w
)


,

j


(
w
)






D


i


(
y
)


,

j


(
y
)









]





Fix distinct w, ycustom characterL. Note that (i(w), j(w))≠(i(y), j(y)) conditioned on custom character occurring. Suppose first that i(w)≠i(y), then the equality shown in FIG. 5 is obtained since it holds for any fisted hi(w), hi(w), where the final equality follows from EQ 2.


Now suppose that i(w)=i(y). Let i=i(w)=i(y) for notational convenience. Define the indicator random variable Γw(z)=1 if hi(z)=j(w), and similarly let Γy(z)=1 if hi(z)=j (y). Then the expression E[sign(xw)sign(xyi(w)(w)σi(w)(y)Di(w), j(w)Di(y), j(y)|custom character] can be expanded using the definition Di(w), j(w) and Di(y), j(y) as:










z
,

z







E
[


sign


(

x
w

)




sign


(

x
y

)




x
z



z

z






Γ
w



(
z
)





Γ
y



(

z


)





σ
i



(
z
)





σ
i



(

z


)


×


σ
i



(
w
)





σ
i



(
y
)








]




The variables z and z′ are fixed and a summand of the form E[sign(xw)sign(xy)xzxz′Γw(z)Γy(z′)×σi(z)σi(z′)σi(w)σi(y)|custom character] is analyzed.


Consider any fixing of hi subject to the occurrence of custom character, and recall that σi is independent of hi. Since σj is 4-wise independent and a sign vector, it follows that this summand vanishes unless {z, z′} {w, y}. Moreover, since Γw(y)=Γy(w)=0, while Γw(w)=Γy(y)=1, then there must be the following, z=w and z′=y. In this case, E[sign(xw)sign(xy)xzxz′Γw(z)Γy(z′)×σi(z)σi(z′)σi(w)σi(y)|hi]=|xw|·|xy|.


Hence, the total contribution of all distinct w, ycustom characterL to Var[Φ|custom character] is at most Σw≠∈L|xw|·|xy|.


Combining the bounds, it follows that the equalities in FIG. 6 are true. This completes the proof of the lemma.


By Chebyshev's inequality, Lemma 2 implies








Pr


[





Φ
-




x
head



1






ɛ




x


1



|

]





Var


[

Φ
|

]




ɛ
2





x


1
2






2


ɛ
2





x


1
2



9


ɛ
2





x


1
2




=

2
9






and thus








Pr


[


(




Φ
-




x
head



1






ɛ




x


1



)



]





(

7
9

)

·

(

9
10

)



=


7
10

.





The above findings are summarized with the following theorem:


THEOREM 3: Suppose that is a set L[n] of indices j so that (1) for all j for which |xj|≧υ2∥x∥1, j∈L and (2) for all j∈L, |xj|≧(∈2/2)∥x∥1. Further, suppose sign(xj) is known for each jcustom characterL. Then, there is a 1-pass algorithm, Filter, which outputs an estimate for which with probability at least 7/10, |Φ−∥xhead1|≦∈∥x∥1. The space complexity of the algorithm is O(∈−2 log(1/∈)log(mM)+log(1/∈)log log n). The update time is O(log(1/∈), and the reporting time is O(∈−2 log(1/∈)).


The Final Algorithm


The final algorithm for (1±∈)-approximating ∥x∥1, which was outlined above is now analyzed. The full details of the algorithm are shown in FIG. 7. Before giving the algorithm and analysis, the custom character1 heavy coordinates problem is defined.


Definition 4: Let 0<γ<φ and δ>0 be given. In the custom character1 heavy coordinates problem, with probability at least 1−δ a list L[n] is outputted such that:

    • 1. For all i with |xi|≧ø∥x∥1, i∈L, icustom characterL.
    • 2. For all icustom characterL, |xi|>(ø−γ)∥x∥1.
    • 3. For each icustom characterL, an estimate {circumflex over (x)}i is provided such that |{tilde over (x)}i−x1|<γ∥x∥1.


Note that for γ≦φ/2, the last two items above imply sign(xi) can be determined for icustom characterL. For a generic algorithm solving the custom character1 heavy coordinates problem HHUpdate(φ), HHReport(φ), and HHSpace(φ) are used to denote update time, reporting time, and space, respectively, with parameter φ and γ=φ/2, δ=1/20.


There exist a few of solutions to the custom character1 heavy coordinates problem in the turnstile model. The work gives an algorithm with HHSpace(φ)=O(φ−1 log(mM)log(n)), HHUpdate(φ)=O(log(n)), and with HHReport(φ)=O(n log(n)), and gives an algorithm with HHSpace(φ)=O(φ−1 log(φn) log log(φn) log(1/φ) log(mM)), and with HHUpdate(φ)=O(log(φn) log log(n) log(1/φ), and HHReport(φ)=O(φ−1 log(φn) log log(φn) log(1/φ).


Also, the following theorem follows from Lemma 2.2 (with k=3 in their notation). In Theorem 5 (and in FIG. 7), the Cauchy distribution is a continuous probability distribution defined by its density function μ(x)=(π(1+x2))−1. One can generate a Cauchy random variable X by setting X=tan(πU/2) for U a random variable uniform in [0, 1]. Of course, to actually implement our algorithm (or that of Theorem 5) one can only afford to store these random variables to some finite precision; this is discussed in Remark 9 below.


THEOREM 5: For an integer n>0, let A1[j], . . . , An[j] be 3n independent Cauchy random variables for j=1, 2, 3. Let xcustom characterRn be arbitrary. Then given








C
j

=





i
=
1

n






A
i



[
j
]


·

x
i







for





j


=
1


,
2
,
3
,





the estimator







Est
GM

=



Est
GM



(


C
1

,

C
2

,

C
3


)


=



8


3


9

·





C
1



·



C
2



·



C
3




3








satisfies the following two properties:







1.






E


[

Est
GM

]



=



x


1








2.






Var


[

Est
GM

]



=


19
8

·



x


1
2






It is shown in Theorem 6 that the algorithm outputs (1±O(e))∥x∥1 with probability at least 3/5. Note this error term can be made c by running the algorithm with ∈′ being ∈ times a sufficiently small constant. Also, the success probability can be boosted to 1−δ by running O(log(1/δ)) instantiations of the algorithm in parallel and returning the median output across all instantiations.


THEOREM 6: The algorithm of FIG. 7 outputs (1±O(∈))∥x∥1 with probability at least 3/5.


PROOF: Throughout this proof A is used to denote the 3n-tuple (A1[1], . . . , An[n], . . . , A1[3], . . . , An [3]), and for S[n], custom characterS is the event that the hash family custom character that is randomly selected in Step 3 via Theorem 1 is ISI-wise independent when restricted to S. For an event custom character, 1 denotes the indicator random variable for custom character. The variable xhead is used denote x projected onto the coordinates in L, and xtad is used to denote the remaining coordinates. Note ∥x∥1=∥xhead1+∥xtail1.


The following lemma will now be proved. The proof requires some care since h is not always a uniform hash function on small sets, but is only so on any particular (small) set with large probability.


LEMMA 7: Conditioned on the randomness of HH of FIG. 7,








E

A
,
h


[


R


I



·




j

I






L
~

1



(
j
)




]

=


(

1
±

O


(
ɛ
)



)







x
tail



1

.






PROOF: For ρ=1−Pr(custom characterL), see FIG. 8, by Theorem 1 and Theorem 5.


The above expectation is now computed conditioned on I. Let custom characterI be the event I=I′ for an arbitrary I′. Then, see FIG. 9. Now, see FIG. 10. It should be noted that if custom characterL∪{i} occurs, the custom characterI′ is independent of the event h(i)=j. Also, if custom characterL occurs, then custom characterI′ is independent of custom characterL∪{i}. Thus, the above equals









Pr
h



[


h


(
i
)


=

j




L


{
i
}





]


·


Pr
h



[



I






L


]


·

Pr


[




L


{
i
}






L


]



+



Pr


[






L


{
i
}







L


]


·

Pr


[



I






L


]



×


Pr


[



h


(
i
)


=

j






L


{
i
}






,


L

,


I




]


.






Note Pr[custom characterL∪{i}|custom characterL]≦[custom characterL∪{i}]/Pr[custom characterL]=ρ′i/(1−ρ) for ρ′i=1−Pr[custom characterL∪{i}]. Also, Pr[custom characterL∪{i}|custom characterL]≧Pr[custom characterL∪{i}] since Pr[custom characterL∪{i}] is a weighted average of Pr[custom characterL∪{i}|custom characterL] and Pr[custom characterL∪{i}|custom characterL], and the latter is 0. This for some ρ″i∈[0, ρ′i] EQ. (4) is








R


I



·




j

I







i

L







x


i

·

(



1
-

ρ
i



R

±


ρ
i



1
-
ρ



)





=





x
tail



1

-





i

L





ρ
i






x


i



±


(



max
i



ρ
i




1
-
ρ


)

·
R
·





x
tail



1

.








By the setting of c=2 when picking the hash family of Theorem 1 in Step 3, ρ, ρ′i, ρ″i=O(∈3)) for all I, and thus ρ′i=I (1−ρ)·R=O(∈), implying the above is (1±O(∈))∥xtail1. Plugging this into EQ. 3 then shows that the desired expectation is (1±O(∈))∥xtail1.


The expected variance of (R/|I|)·Σj∈I{tilde over (L)}1(j) is now bound.


LEMMA 8: Conditioned on HH being correct,








E
h

[


Var
A

[


R


I



·




j

I

R





L
~

1



(
j
)




]

]

=


O


(


ɛ
2

·



x


1
2


)


.





PROOF: For any fixed h, R/|I| is determined and the {tilde over (L)}1(j) are pairwise independent. Thus for fixed h,








Var
A

[


R


I



·




j

I






L
~

1



(
j
)




]

=



(

R


I



)

2

·




j

I






Var
A



[



L
~

1



(
j
)


]


.







First observe that since |I|≧R−|L|≧2/∈2, for any choice of h R/|I|≦2. Thus, up to a constant factor, the expectation that is trying to be computed is








E
h

[


Var
A

[




j

I






L
~

1



(
j
)



]

]

.





For notational convenience, {tilde over (L)}1(j)=0 if j≠I. Now see FIG. 11. Now consider the quantity Prh[h(i)=j|j∈I]. Then Prh[h(i)=j|j∈I] is equal to that shown in FIG. 12. Then by Bayes' theorem, what is shown in FIG. 12 is at most that which is shown in FIG. 13. Note that |L|/R≦1/2. Also, by choice of c, z I the application of Theorem 1 in step 3, Pr[custom characterL]=1−O(∈) and Pr[custom characterL∪{i}]=O(1/R2). Thus overall Prh[h(i)=j|j∈I]=O(1/R).


An essentially identical calculation, but conditioning on custom characterL∪{i,i′} instead of custom characterL∪{i}, gives that Prh[(h(i)=j)custom character(h(i′)=j)|j∈I]=O(1/R2). Combining these bounds with Eq. 5, the expected variance that is trying to be computed is O(∥x∥tail22+∥xtail|12/R).


The second summand is O(∈2∥x∥12). For the first summand, conditioned on HH being correct, every |xi| for i∉L has |xi|≦∈2∥x∥1. Under this constraint, ∥xtail22 maximized when there are exactly 1/∈2 coordinates i∈L each with |xi|=∈2∥x∥1 in which case ∥xtail|22=∈2∥x∥12.


The proof of correctness of the full algorithm shown in FIG. 7 will now be completed as follows. Conditioning is done on the event ∈HH that HH succeeds, i.e., satisfies the three conditions of Definition 4. Given this, conditioning is done on the event ∈F that F succeeds as defined by Theorem 3, i.e., that Φ=∥xhead1±∥x∥1.


Next, the quantity






X
=


R


I



·




j

I






L
~

1



(
j
)









is looked at.


By Lemma 7, E[X], even conditioned on the randomness used by HH to determined L, is (1±O(∈))∥xtail1. Also conditioned on ∈HH the expected value of Var[X] for a random h is O(∈2∥x∥12). Since Var[X] is always non-negative, Markov's bound applies and Var[X]=O(∈2∥x∥12) with probability at least 19/20 (over the randomness in selecting h).


Thus, by Chebyshev's inequality,












Pr

A
,
h




[





X
-

E


[
X
]





>

t





ɛ




x


1





HH


]


<


1
20

+

O


(

1
/

t
2


)




,




(

EQ
.




6

)








which can be made at most 1/15 by setting t a sufficiently large constant. Call the event in EQ. 6 custom character. Then, as long as custom characterHH, custom characterF, custom character occur, the final estimate of ∥x∥1 is (1±O(∈))(∥xtail1+∥xhead1±O(∈∥x∥1)=(1±O∈∥x∥1) as desired. The probability of correctness is then at least that shown in FIG. 14.


Remark 9: It is known from previous work, that each Ai[j] can be maintained up to only O(log(n/∈)) bits of precision, and requires the same amount of randomness to generate, to preserve the probability of correctness to within an arbitrarily small constant. Then, note that the counters Bi[j] each only consume O(log(nmM/∈)) bits of storage.


Given Remark 9, the following theorem is given.


Theorem 10: Ignoring the space to store the Ai[j], the overall space required for the algorithm of FIG. 7 is O((∈−2 log(nmM/∈)+log log(n))log(1/∈)+HHSpace(∈2). The update time and reporting times are, respectively, O(log(1/∈))+HHUpdate(∈2), and O(∈−2 log(1/∈)+HHReport (∈2).


PROOF: Ignoring F and HH, the update time is O(1) to compute h, and O(1) to update the corresponding Bh(i). Also ignoring F and HH, the space required is O(∈−2 log(nmM/∈)) to store all the Bi[j] (Remark 9), and O(∈−2 log(1/∈+log log(n)) bits to store h and randomly select the hash family it comes from (Theorem 1). The time to compute the final line in the estimator, given L and ignoring the time to compute Φ, is O(1/∈). The bounds stated above then take into account the complexities of F and HH.


Derandomizing the Final Algorithm


Observe that a naive implementation of storing the entire tuple A in FIG. 7 requires Ω(n log(n/∈)) bits. Considering that one goal is to have a small-space algorithm, this is clearly not affordable. As it turns out, using a now standard technique in streaming algorithms, one can avoid storing the tuple A explicitly. This is accomplished by generating A from a short, truly random seed which is then stretched out by a pseudorandom generator against space-bounded computation. In Indyk's original argument, he used Nisan's PRG to show that his entire algorithm was fooled by using the PRG to stretch a short seed of length O(∈−2 log(n/∈) log(nmM/∈)) to generate Θ(n/∈2) Cauchy random variables. However, for fooling this algorithm, this derandomization step used Ω(1/∈2) time during each stream update to generate the necessary Cauchy random variables from the seed. Given that another goal of one or more embodiments is to have fast update time, this is not desired. Therefore, to derandomize the final algorithm discussed above, Nisan's PRG can be applied in such a way that the time to apply the PRG to the seed to retrieve any Ai[j] is small.


First, recall the definition of a finite state machine (FSM). An FSM M is parameterized by a tuple (Tinit, S, Γ, n). The FSM M is always in some “state”, which is just a string xcustom character{0, 1}S, and it starts in the state Tinit. The parameter Γ is a function mapping {0, 1}S×{0, 1}n→{0, 1}S. Notation is abused and for xcustom character({0, 1}n)r for r a positive integer, Γ(T, x) is used to denote Γ( . . . (Γ(Γ(T, x1), x2), . . . ), xr). Note that given a distribution D over ({0, 1}n)r, there is an implied distribution M(D) over {0, 1}S obtained as Γ(Tinit, D).


DEFINITION 11: Let t be a positive integer. For D, D′ two distributions on {0,1}t, the total variation distance Δ(D, D′) is defined by







Δ


(

D
,

D



)


=


max

T



{

0
,
1

}

t










Pr

X
-
D




[

X

T

]


-


Pr

Y


D






[

Y

T

]





.






THEOREM 12. Let Ut denote the uniform distribution on {0, 1}t. For any positive integers r, n, and for some S=Θ(n), there exists a function Gnisan=: {0,1}s→({0, 1}n)r with s=O(S log(r)) such that for any FSM M=(Tinit,S,T,n), Δ(M((Un)r), M(Gnisan(US)))≦=2−S.


Furthermore, for any xcustom character{0, 1}s and icustom character[r], computing the n-bit block Gnisan(x)i requires O(S log(r)) space and O(log(r)) arithmetic operations on O(S)-bit words.


Before finally describing how Theorem 12 fits into a de-randomization of FIG. 7, the following standard lemma is stated.


LEMMA 13: If X1, . . . , Xm are independent and Y1, . . . , Ym are independent, then







Δ


(



X
1


x












xX
m


,


Y
1


x












xY
m



)







i
=
1

m




Δ


(


X
i

,

Y
i


)


.






Now, the derandomization of FIG. 7 is as follows. Condition on all the randomness in FIG. 7 except for A. Recall that R=Θ(1/∈2) “buckets” Bu. Each bucket contains three counters, which is a sum of at most n Cauchy random variables, each weighted by at most mM. Given the precision required to store A (Remark 9), the three counters in B, in total consume S′=O(log(nmM/∈)) bits of space. Consider the FSM Mu which has 2S states for S=S′+log(n), representing the state of the three counters together with an index icurcustom character[n] that starts at 0. Define t as the number of uniform random bits required to generate each Ai[j], so that t=O(log(nmM/∈)) by Remark 9. Note t=Θ(S). Consider the transition function Γ:{0, 1}3t→{0, 1}S defined as follows: upon being fed (Ai[1], Ai[2], Ai[3]) (or more precisely, the 3t uniform random bits used to generate this tuple), increment icur then add Ai[j]·xi to each Bu[j], for i being the (icur)th index icustom character[n] such that h(i)=u. Now, note that if one feeds the (Ai[1], Ai[2], Ai[3]) for which h(i)=u to Mu, sorted by i, then the state of Mu corresponds exactly to the state of bucket Bu in the algorithm.


By Theorem 12, if rather than defining A by 3tr truly random bits (for r=n) it is defined instead by stretching a seed of length s=O(S log(n))=O(log(nmM/∈) log(n)) via Gnisan, then the distribution on the state of Bu at the end of the stream changes by at most a total variation distance of 2−S. Now, suppose R independent seeds are used to generate different A vectors in each of the R buckets. Note that since each index icustom character[n] is hashed to exactly one bucket, the Ai[j] across each bucket need not be consistent to preserve the behavior of our algorithm. Then for Ut being the uniform distribution on {0, 1}t,

Δ(M1(U3t)rx . . . xMR(U3t)r
M1(Gnisan(US))x . . . xMR(US)))≦2−S

by Lemma 13.


By increasing S by a constant factor, R·2−S can be ensured to be an arbitrarily small constant δ. Now, note that the product measure on the output distributions of the Mu corresponds exactly to the state of the entire algorithm at the end of the stream. Thus, if one considers T to be the set of states (B1, . . . , BR) for which the algorithm outputs a value (1±∈)∥x∥1 (i.e., is correct), by definition of total variation distance (Definition 11), the probability of correctness of the algorithm changes by at most an additive δ when using Nisan's PRG instead of uniform randomness. Noting that storing R independent seeds just takes Rs space, and that the time required to extract any Ai[j] from a seed requires O(log(n)) time by Theorem 12, then there is the following theorem.


THEOREM 14: Including the space and time complexities of storing and accessing the Ai[j], the algorithm of FIG. 7 can be implemented with an additive O(∈−2 log(nmM/∈) log(n)) increase to the space, additive O(log(n)) increase to the update time, and no change to the reporting time, compared with the bounds given in Theorem 10.


Therefore, as can be seen from the above discussion, one or more embodiments provide 1-pass algorithm using ∈−2 polylog(nmM) space for custom character1-estimation in data streams with polylog(nmM) update time, and reporting time ∈2 polylog(nmM). This algorithm is the first to be simultaneously optimal in both the space and the update time up to polylog(nmM) factors. Conventional algorithms either required at least ∈−3 polylog(nmM) bits of space, or at least ∈2 update time. As ∈ can be arbitrarily small, the result of one or more embodiments can provide a substantial benefit over conventional algorithms. In light of known lower bounds, the space and time complexity of these one or more embodiments are optimal up to polylog(nmM) factors.


Operational Flow



FIG. 15 is an operational flow diagram illustrating one example of measuring the distance between two or more vectors. The operational flow diagram of FIG. 15 begins at step 1502 and flows directly to step 1504. The data stream analyzer 110, at step 1504, analyzes at least two vectors of n coordinates. The data stream analyzer 110, at step 1506, identifies a set of heavy coordinates from the set of n coordinates associated with the at least two vectors. The data stream analyzer 110, at step 1508, identifies a set of light coordinates from the set of n coordinates. The data stream analyzer 110, at step 1510, determines a first estimate of a contribution from the set of heavy coordinates to the custom character1 distance between the at least two vectors. The data stream analyzer 110, at step 1512, determines a second estimate of a contribution from the set of light coordinates to the custom character1 distance between the at least two vectors. The data stream analyzer 110, at step 1514, sums the first estimate and the second estimate. The control flow then exits at step 1516.


Non-Limiting Examples

Aspects of the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment (including firmware, resident software, micro-code, etc.), or an embodiment combining software and hardware aspects that may all generally be referred to herein as a “circuit,” “module” or “system. Also, aspects of the present invention have been discussed above with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the invention. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems that perform the specified functions or acts, or combinations of special purpose hardware and computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks.


These computer program instructions may also be stored in a computer readable medium that can direct a computer, other programmable data processing apparatus, or other devices to function in a particular manner, such that the instructions stored in the computer readable medium produce an article of manufacture including instructions which implement the function/act specified in the flowchart and/or block diagram block or blocks.


The computer program instructions may also be loaded onto a computer, other programmable data processing apparatus, or other devices to cause a series of operational steps to be performed on the computer, other programmable apparatus or other devices to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide processes for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks.


Any combination of one or more computer readable medium(s) may be utilized. The computer readable medium may be a computer readable signal medium or a computer readable storage medium. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples (a non-exhaustive list) of the computer readable storage medium include computer diskette, a hard disk, a random access memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or Flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. A computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device.


A computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device.


Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.


Computer program code for carrying out operations for aspects of the present invention may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, Smalltalk, C++ or the like and conventional procedural programming languages, such as the “C” programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the latter scenario, the remote computer may be connected to the user's computer through any type of network, including a local area network (LAN) or a wide area network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet Service Provider).


The terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. As used herein, the singular forms “a”, “an” and “the” are intended to include the plural forms as well, unless the context clearly indicates otherwise. It will be further understood that the terms “comprises” and/or “comprising,” when used in this specification, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.


The description of the present invention has been presented for purposes of illustration and description, but is not intended to be exhaustive or limited to the invention in the form disclosed. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the invention. The embodiments above were chosen and described in order to best explain the principles of the invention and the practical application, and to enable others of ordinary skill in the art to understand the invention for various embodiments with various modifications as are suited to the particular use contemplated.

Claims
  • 1. An information processing system for determining a distance between at least two vectors of n coordinates, the information processing comprising: a memory;a processor communicatively coupled to the memory; anda data stream analyzer communicatively coupled to the memory and the processor, the data stream analyzing being configured to perform a method comprising:identifying a set of coordinates from a set of n coordinates associated with at least two vectors as a set of heavy coordinates, wherein a heavy coordinate is represented as |xi|≧∈2∥x∥1, where x is a vector, i is a coordinate in the set of n coordinates, and ∈ is an arbitrary number;identifying a set of coordinates from the set of n coordinates associated with the at least two vectors, as a set of light coordinates, wherein a light coordinate is represented as |xi|≧∈2∥x∥1;determining a first estimation of a contribution from the set of heavy coordinates to a rectilinear distance between the at least two vectors; wherein determining the first estimation comprises multiplying each of a set of stream updates by a subset of a set of constructed hash functions, wherein the subset is defined based on ∈2;determining a second estimation of a contribution from the set of light coordinates to the rectilinear distance, wherein determining the second estimation comprises multiplying each of a set of stream updates by a subset of the set of constructed hash functions, wherein the subset is defined based on ∈2, and wherein the second estimation is determined separate from the first estimation; andcombining the first estimation with the second estimation.
  • 2. The information processing system of claim 1, wherein determining the first estimation comprises: maintaining a first data structure by: selecting a random sign vector σ∈{−1, 1}n from a 4-wise independent family;setting r=[27/∈2];select a hash function h: [n]→[r] from a family constructed randomly; andinitializing r counters b1, . . . , br;receiving an update in the form of (i, v), where v is a change to i; andadding σ(i)·v to bh(i).
  • 3. The information processing system of claim 2, wherein determining the first estimation further comprises: maintaining a second data structure by: initializing s=[log 3(1/∈2)]+3 independent copies of the first data structure;given the update (i, v) adding σ(i)·v to bh(i) to each of the three copies of the first data structure.
  • 4. The information processing system of claim 2, wherein a space complexity of the second data structure is O(∈−2 log(1/∈)log(mM)+log(1/∈)log log n), where O is where O represents a constant C that is independent of n, m is a number of updates from a set [n]×{−M, . . . , M}, and where an update time of the second data structure is O(log(1/∈)).
  • 5. The information processing system of claim 3, wherein the first estimation is equal to
  • 6. The information processing system of claim 3, wherein determining the second estimation comprises: maintaining R=Θ(1/∈2) buckets Bi in parallel with the second data structure;mapping each i in [n] to exactly one bucket i∈[R], wherein the ith bucket keeps track of a dot product of x, restricted to those indices hashed to i, with three random Cauchy vectors;calculating a geometric mean of each bucket corresponding to the set of light coordinates; andsumming the geometric mean calculating each bucket.
  • 7. The information processing system of claim 6, wherein the heavy coordinates are identified using a CountMin sketch algorithm, and wherein the set of light coordinates are identified as a set of buckets from the R=Θ(1/∈2) buckets failing to comprise any heavy coordinates.
  • 8. A computer program product for determining a distance between at least two vectors of n coordinates comprising: a non-transitory computer readable storage medium having computer readable program code embodied therewith, the computer readable program code configured to perform a method comprising:
  • 9. The computer program product of claim 8, wherein determining the first estimation comprises: maintaining a first data structure by: selecting a random sign vector σ∈{−1, 1}n from a 4-wise independent family;setting r=[27/∈2];select a hash function h: [n]→[r] from a family constructed randomly; andinitializing r counters b1, . . . , br;receiving an update in the form of (i, v), where v is a change to i; andadding σ(i) v to bh(i).
  • 10. The computer program product of claim 9, wherein determining the first estimation further comprises: maintaining a second data structure by: initializing s=[log 3(1/∈2)]+3 independent copies of the first data structure;given the update (i, v) adding σ(i)·v to bh(i) to each of the three copies of the first data structure.
  • 11. The computer program product of claim 9, wherein a space complexity of the second data structure is O(∈−2 log(1/∈)log(mM)+log(1/∈)log log n), where O is where O represents a constant C that is independent of n, m is a number of updates from a set [n]×{−M, . . . , M}, and where an update time of the second data structure is O(log(1/∈)).
  • 12. The computer program product of claim 10, wherein determining the second estimation comprises: maintaining R=Θ(1/∈2) buckets Bi in parallel with the second data structure;mapping each i in [n] to exactly one bucket i∈[R], wherein the ith bucket keeps track of a dot product of x, restricted to those indices hashed to i, with three random Cauchy vectors;calculating a geometric mean of each bucket corresponding to the set of light coordinates; andsumming the geometric mean calculating each bucket.
  • 13. The information processing system of claim 12, wherein the heavy coordinates are identified using a CountMin sketch algorithm, and wherein the set of light coordinates are identified as a set of buckets from the R=Θ(1/∈2) buckets failing to comprise any heavy coordinates.
US Referenced Citations (15)
Number Name Date Kind
6152563 Hutchinson et al. Nov 2000 A
6173415 Litwin et al. Jan 2001 B1
7158961 Charikar Jan 2007 B1
7437385 Duffield et al. Oct 2008 B1
7590657 Cormode et al. Sep 2009 B1
7751325 Krishnamurthy et al. Jul 2010 B2
7756805 Cormode et al. Jul 2010 B2
7779143 Bu et al. Aug 2010 B2
20050111555 Seo May 2005 A1
20080225740 Martin et al. Sep 2008 A1
20090031175 Aggarwal et al. Jan 2009 A1
20090046581 Eswaran et al. Feb 2009 A1
20090172059 Cormode et al. Jul 2009 A1
20090292726 Cormode et al. Nov 2009 A1
20100049700 Dimitropoulos et al. Feb 2010 A1
Non-Patent Literature Citations (42)
Entry
Aggarwal, C., et al., “On the Surprising Behavior of Distance Metrics in High Dimensional Spaces,” In ICDT, pp. 420-434, 2001.
Beyer, K.S., et al., “Bottomup Computation of Sparse and Iceberg Cubes,” In SIGMOD Conference, pp. 359-370, 1999.
Cand'es, E.J., et al., “Stable Signal Recovery From Incomplete and Inaccurate Measurements,” Communications on Pure and Applied Mathematics, 59(8), 2006.
Chaudhuri, S., et al., “On Random Sampling Over Joins,” in SIGMOD Conference, pp. 263-274, 1999.
Cisco NetFlow. http://www.cisco.com/go/netflow.
Clarkson, K.L., “Subgradient and Sampling Algorithms for 11 Regression,” In Proceedings of the 16th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), 2005.
Cohen, E., et al., “Algorithms and Estimators for Accurate Summarization of Internet Traffic,” In Internet Measurement Conference, pp. 265-278, 2007.
Cormode, G., et al., “Comparing Data Streams Using Hamming Norms (how to zero in),” IEEE Trans. Knowl. Data Eng., 15(3): 529-540, 2003.
Cormode, G., et al., “Sketching Streams Through the Net: Distributed Approximate Query Tracking,” In VLDB, pp. 13-24, 2005.
Cormode, G., et al., “Space and Time Efficient Deterministic Algorithms for Biased Quantiles Over Data Streams,” In PODS, pp. 263-272, 2006.
Cormode, G., et al., “Time Decaying Aggregates in Out-of-Order Streams,” In PODS, pp. 89-98, 2008.
Cormode, G., et al., “An Improved Data Stream Summary: The Countmin Sketch and Its Applications,” J. Algorithms, 55(1):58-75, 2005.
Cormode, G., et al., “Space Efficient Mining of Multigraph Streams,” In Proceedings of the 24th ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems (PODS), pp. 271-282, 2005.
Cormode, G., et al., “What's Hot and What's Not: Tracking Most Frequent Items Dynamically,” ACM Trans. Database Syst., 30(1):249-278, 2005.
Dodge, Y., “L1-Statistical Procedures and Related Topics,” Institute for Mathematical Statistics, 1997.
Fang, M., et al., “Computing Iceberg Queries Efficiently,” In VLDB, pp. 299-310, 1998.
Feigenbaum, J., et al., “An Approximate L1 Difference Algorithm for Massive Data Streams,” SIAM J. Comput., 32(1): 1310323, 2006.
Gilbert, A.C., et al., “One Sketch for all: Fast Algorithms for Compressed Sensing,” In STOC, pp. 237-246, 2007.
Indyk, P., “Stable Distributions, Pseudo-Random Generators, Embeddings, and Data Stream Computation,” J. ACM, 53(3): 307-323, 2006.
Jayram, T.S., et al., “The Data Stream Space Complexity of Cascaded Norms,” In Proceedings of the 50th Annual IEEE Symposium on Foundations of Computer Science (FOCS), 2009.
Kane, D.M., et al., “On the Exact Space Complexity of Sketching Small Norms,” In Proceedings of the 21st Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), to appear, 2010.
Labib, K., et al., “A Hardware Based Clustering Approach for Anomaly Detection,” 2006.
Lau, W.C., et al., “Datalite: A Distributed Architecture for Traffic Analysis Via Lightweight Traffic Digest,” In BROADNETS, pp. 622-630, 2007.
Lopuhaa, H.P., et al., “Breakdown Points of Affine Equivalent Estimators of Multivarite Location and Co-Variance Matrices,” Annals of Statistics, 19(1):229-248, 1991.
Nelson, J., et al., “A Near Optimal Algorithm for L1 Difference,” CoRR, abs/0904.2027, 2009.
Nie, J., et al., “Semi-Definite Representation of the Kellipse,” Algorithms in Algebraic Geometry, IMA Volumes in Mathematics and its Applications, 146:117-132, 2008.
Nisan, N., “Pseudorandom Generators for Space Bounded Computation,” Combinatorica, 12(4):449-461, 1992.
“Open Problems in Data Streams and Related Topics”, IITK Workshop on Algorithms for Data Streams, 2006. http:www.cse,iitk.ac.in/users,sganguly.data-stream-probs.pdf.
Pagh, A., et al., “Uniform Hashing in Constant Time and Linear Space,” SIAM J. Comput., 38(1):85-96, 2008.
Schweller, R.T., Reversible Sketches: Enabling Monitoring and Analysis Over High Speed Data Streams, IEEE/ACT Trans. Netw., 15(5): 1059-1072, 2007.
Thorup, M., et al., “Tabulation Based for Universal Hashing with Applications to Second Moment Estimation,” In Proceedings of the 15th Annual SCM-SIAM Symposium on Discrete Algorithms (SODA), pp. 615-624, 2004.
Vadhan, S.P., Pseudorandomness II. Manuscript. http://people.seas.harvard.edu/salil/cs225/spring09/lecnotes/FnTTCS-vol2.pdf.
Yi, B., et al., “Fast Time Sequence Indexing for Arbitrary Ip Norms,” In VLDB, pp. 385-394, 2000.
Indyk, P., et al., “Declaring Independence Via the Sketching of Sketches,” In SODA pp. 737-745, 2008.
Berinde, R., et al., “Space-Optimal Heavy Hitters with Strong Error Bounds,” PODS'09, Jun. 29-Jul. 2, 2009, Providence, Rhode Island; Copyright 2009, ACM 978-1-60558-553-6/09/06.
Cormode, G., et al., “Finding Hierarchical Heavy Hitters in Streaming Data,” Copyright 2007, ACM 0362-5915/2007/0300-0001.
Jayram, T.S., et al., “The Data Stream Space Complexity of Cascaded Norms,” 2009n 50th Annual IEEE Symposium on Foundations of Computer Science, 0272-5428/09; copyright 2009 IEEE; DOI 10.1109/FOCS.2009.82.
Zhu, Z., et al., “Finding Heavy Hitters by Packet Count Flow Sampling,” 2008 International Conference on Computer and Electrical Engineering, 978-0-7695-3504-3/08 copyright 2008 IEEE; DOI 10.1109/ICCEE.2008.90.
Lahiri, B., et al., “Finding Correlated Heavy-Hitters Over Data Streams,” 978-1-4244-5736-6/09 copyright 2009 IEEE.
Zhang, N., et al., “Identifying Heavy-Hitter Flows Fast and Accurately,” copyright 2010 IEEE, 978-1-4244-5824-0.
Woodruff, D.P., et al., “Fast Manhattan Sketches in Data Streams,” to be copyrighted Jun. 2011.
Non-Final Rejection for U.S. Appl. No. 13/563,864 dated Nov. 23, 2012.
Related Publications (1)
Number Date Country
20120215803 A1 Aug 2012 US