The present invention generally relates to methods for summarizing data, and more particularly to methods for producing summaries of unaggregated data appearing, e.g., in massive data streams, for use in subsequent analysis of the data.
It is often useful to provide a summary of a high volume stream of unaggregated weighted items that arrive faster and in larger quantities than can be saved, so that only a sample can be stored efficiently. Preferably, we would like to provide a generic sample of a certain limited size that we can later use to estimate the total weight of arbitrary subsets of the data.
Many data sets occur as unaggregated data sets, where multiple data points are associated with each key. The weight of a key is the sum of the weights of data points associated with the key and the aggregated view of the data, over which aggregates of interest are defined, includes the set of keys and the weight associated with each key.
In greater detail, this invention is concerned with the problem of summarizing a population of data points, each of which takes the form of (k, x) where k is a key and x≧0 is called a weight. Generally, in an unaggregated data set, a given key occurs in multiple data points of the population. An aggregate view of the population is provided by the set of key weights: the weight of given key is simply the sum of weights of data points with that key within the population. This aggregate view would support queries that require selection of sub-populations with arbitrary key predicates.
However, in many application scenarios, it is not feasible to compute the aggregate key weights directly; we describe some of these application scenarios below. In these applications time and processing constraints prohibit direct queries and make it necessary to first compute a summary of the aggregate view over the data points, and then to process the query on the summary. A crucial requirement of such summaries is that they must also support selection of subpopulations with arbitrary key predicates. Since the keys of interest are not assumed to be known at the time of summarization, the summarization process must retain per-key statistical estimates of the aggregate weights.
Turning now to applications of interest, communications networking provides a fertile area for developing summarization methods. In the Internet Protocol (IP) suite, routers forward packets between high speed interfaces based on the contents of their packet headers. The header contains the source and destination address of the packets, and usually also source and destination port number which are used by end hosts to direct packets to the correct application executing within them. These and other fields in the packet header each constitute a key that identifies the IP flow that the packet belongs to. In our context, we can think of the set of keys of packets arriving at the router in some time interval, each paired with the byte size of the corresponding packet, as a population of unaggregated data points.
Routers commonly compile summary statistics on the traffic passing through them, and export them to a collector which stores the summaries and support query functions. Export of the unaggregated data is infeasible due to the expense of the bandwidth, storage and computation resources that would be required to support queries. On the other hand, direct aggregation of byte sizes over all distinct flow keys at a measuring router is generally infeasible at present due the amount of (fast) memory that would be required to maintain and update at line rate the summaries for the large number of distinct keys present in the data Thus some other form of summarization is required.
Common queries for network administrators would include: (i) calculating the traffic matrix, i.e., the weight between source-destination address pairs; (ii) the application mix, as indicated by weight in various port numbers (iii) popular websites, as indicated by destination address using certain ports. Although some queries are routine, in exploratory and troubleshooting tasks the keys of interest are not known in advance.
Other network devices that serve content or mediate network protocols generate logs comprising records of each transaction. Examples include web servers and caches; content distribution servers and caches; electronic libraries for software, video, music, books, papers; DNS and other protocol servers. Each record may be considered as a data point, keyed, e.g., by requester or item requested, with weight being unity or the size or price of the item requested if appropriate. Offline libraries can produce similar records. Queries include finding the most popular items or the heaviest users, requiring aggregation over keys with common user and/or item. Another example is sensor networks comprise a distributed set of devices each of which generates monitoring events in certain categories.
All of these application examples, to a greater or lesser extent, share the feature that the approximate aggregation is subjected to physical resource constraints on the information that can be carried through time or between locations. For example, there are multiple distinct devices that produce data points, and from which information flows to a single ultimate collector and bandwidth is limited. If data points arrive as a data stream, then storage is limited. In the network traffic statistics application, measurements may be aggregated in mediation devices (e.g. one per geographic router center) which in turn export to a central collector. Sensor networks may deploy a large number of sensor nodes with limited capabilities that can collaborate locally to aggregate their measurements before relaying messages more widely. Physical layout aside, when summarizing data that resides on external memory or when exploiting parallel processing to speed up the computation, the computation is subjected to similar data flow constraints imposed by the underlying model.
There has been considerable amount of work in past years devoted to finding efficient data summarization schemes.
Summarizing Aggregated Data. In aggregated data sets, each data point has a unique key. There are many summarization methods for such data sets in the literature that produce summaries that support unbiased estimates for subpopulation weight. Reservoir sampling from a single stream is the base of the stream database of Johnson et. al. [T. Johnson, S. Muthukrishnan, and I. Rozenbaum, S
These summarizations, however, can not be computed over the unaggregated data unless the data is first aggregated, which is prohibited by application constraints. Firstly, the best estimators for summaries derived from aggregated data utilize the exact weight of each key that is included in the summary. Secondly, the distribution itself of keys that are included in the summary can not be computed under the IFT-constraints. (The only exception is weighted sampling (with or without replacement), but even though we can efficiently determine the keys to include in the summary over the unaggregated data, we need a “second pass” (or another communication round) to obtain the total weight of each included key in order to compute the estimators.)
These methods can be applied to produce data-point-level summaries, by effectively treating each data point as having a unique key. These summaries, however, have large multiplicities of the same key and they are considerably less accurate than key-level summaries. This prompted the development of methods that compute key-level summaries over the unaggregated data.
Summarizing Unaggregated Data. Summarization of unaggregated data sets was extensively studied [N. Alon, Y. Matias, and M. Szegedy, T
Concise samples [P. Gibbons and Y. Matias, N
Counting samples [P. Gibbons and Y. Matias, N
Subpopulation-weight estimators for
Step-counting SH (
Propagation of Summaries on Trees. Multistage aggregation for threshold sampling [N. G. Dufield, C. Lund, and M. Thorup, L
From the foregoing discussion, it will be apparent that a summarization method for unaggregated data sets desirably will work on massive data streams in the face of processing and storage constraints that prohibit full processing; will produce a summarization with low variance for accurate analysis of data; will be one that is efficient in its application (will not require inordinate amounts of time to produce); will provide unbiased summaries for arbitrary analysis of the data; and will limit the worst case variance for every single (arbitrary) subset.
The prior art summarization methods described above have been unable to satisfy all of these desiderata.
Accordingly, there is a need to provide a summarization method for unaggregated data that produces results better than those attainable by prior art methods.
We formalize the above physical and logical constraints on the information flow using Information Flow Trees (IFTs). Data points are generated at leaves of the tree and information flows bottom-up from children to parent nodes. Each node in the tree obtains information (only) from its children and is subjected to a constraint on the information it can propagate to its parent node and to its internal processing constraints (that can also be captured by an IFT). For our summarization problem, at each node, IFT constraints prohibit the computation on the full aggregated data presented from it children nodes children. Rather, it combines them into one summary, which is hence a summary of all the data produced by leaf nodes descended from it. The physical and logical constraints translate to an IFT or family of applicable IFTs. Subjected to these constraints, we are interested in obtaining a summary that allows us to answer approximate queries most accurately.
Our summaries are based on adjusted weights which means that both data sets and summaries have a consistent representation as a weighted set: a set of keys with weights associated with each key. We develop a Summarization Algebra for manipulating adjusted-weight summaries. In our framework, summarization and merging of summaries of unaggregated data sets are composable operators that allow us to perform summarization subject to arbitrary IFT constraints and at the same time preserve the good properties of the summarization.
One of the discoveries we have made is that IFT constraints, and the data stream model constraints in particular, prohibit variance-optimal summarization of unaggregated data. This contrasts with what is possible for aggregated data, for which there exists an optimal summarization scheme (VAROPT) that is applicable for data streams and general IFT constraints [M. T. Chao, A
In particular, in accordance with the summarization method of the present invention, unaggregated data are summarized by utilizing at summarization points an adjusted weight summarization method that inputs a weighted set of size k+1 and outputs a weighted set of size k (removes a single key). As we discuss below, by including the local application of the VAROPT algorithm, we obtain the desirable properties we seek.
In further aspects of the invention, the summarization is presented using mergins and sampling operations applied to a dataset of weighted keys. The algorithm maintains adjusted weights of keys that are unbiased estimates of their actual weight. The summarization is applied using the same adjusted weights.
In a particular aspect of the invention, a method for producing a summary A of data points in an unaggregated data stream wherein the data points are in the form of weighted keys (a, w) where a is a key and w is a weight, and the summary is a sample of k keys a with adjusted weights wa, comprises providing a first set or reservoir L with keys having adjusted weights which are additions of weights of individual data points of included keys; providing a second set or reservoir T with keys having adjusted weights which are each equal to a threshold value τ whose value is adjusted based upon tests of new data points arriving in the data stream; and combining the keys and adjusted weights of the first reservoir L with the keys and adjusted weights of the second reservoir T to form the summary representing the data stream. A third reservoir X may advantageously be used for temporarily holding keys moved from reservoir L and for temporarily holding keys to be moved to reservoir T in response to tests applied to new data points arriving in the stream. The method proceeds by first merging new data points in the stream into the reservoir L until the reservoir contains k different keys, and thereafter applying a series of tests to new arriving data points to determine how their keys and weights compare to the keys and adjusted weights already included in the summary.
For example, a first test may determine if the key of the new data point is already included in reservoir L, and if so, to increase the adjusted weight of the included key by the weight of the new data point; a second test may determine if the key of the new data point is already included in reservoir T, and if so, to move the key from reservoir T to reservoir L and to increase the adjusted weight of the included key by the weight of the new data point. If the key of the new data point is not already in included in reservoir T or reservoir L, a third test may determine if the weight of the new data point is greater than the threshold value T and if so, to add the key and weight of the new data point to reservoir L, and if not to add the key of the new data point to temporary reservoir X. Another test may be utilized to determine if the key with the minimum adjusted weight included in reservoir L is to be moved to reservoir X, and a further test based on a randomly generated number may be used to determine keys to be removed from reservoirs T or X. In this fashion, each new data point (the k+1 data point) is used to produce a sample of k keys that faithfully represents the data stream for use in subsequent analysis.
The foregoing method may be used to summarize separate data streams, and their summaries may in turn be summarized using the same method.
Our method supports multiple weight functions. These occur naturally in some contexts (e.g. number and total bytes of a set of packets). They may also be used for derived quantities, such as estimates of summary variance, which can be propagated up the IFT.
We compared our method to state of the art methods that are applicable to IFTs and specifically only to data streams. We found that our methods produce more accurate summaries for a given size, with typically a reduction in variance.
Our method performed very close to the (unattainable) variance optimality, making it a practically optimal summarization scheme for unaggregated data.
Lastly, our method is efficient, using only 0(log k) amortized per step but in practice is much faster, being constant on non-pathological sequences.
The summarization method for unaggregated data of the present invention provides a summarization that is a composable operator, and as such, is applicable in a scalable way to distributed data and data streams. The summaries support unbiased estimates of the weight of subpopulations of keys specified using arbitrary selection predicates and have the strong theoretical property that the variance approaches the minimum possible if the data set is “more aggregated.”
The main benefit of the present method is that it provides much more effective summaries for a given allocated size than all previous summarization methods for an important class of applications. These applications include IP packet streams, where each IP flow occurs as multiple interleaving packets, distributed data streams produced by events registered by sensor networks, and Web page or multimedia requests to content distribution servers.
These and other objects, advantages and features of the invention are set forth in the attached description.
The foregoing summary of the invention, as well as the following detailed description of the preferred embodiments, is better understood when read in conjunction with the accompanying drawings, which are included by way of example and not by way of limitation with regard to the claimed invention:
Information-flow trees (IFTs) are graphic tools that may be used to represent both (a) the operations performed and (b) the constraints these operations are subjected to when summarizing an unaggregated data set. An IFT is a rooted tree with a data point at each leaf node (the input). Edges are directed from children to parent nodes and have associated capacities that capture storage or communication constraints that are imposed by the computational setup. Information is passed bottom-up from children nodes to parent nodes, subjected to capacity constraints on edges and processing constraints at each internal node. The processing constraints at each internal node can also be modeled using an IFT (this makes it a recursive structure.) The information contained at each node depends only on the data points at descendent leaves.
The constraints imposed by the data stream model are captured by the IFT1 shown in
An information flow tree IFT2 for summarization of multiple distributed data streams S1, S2, etc., over some communication network is illustrated in
The summarization methods that take place in IFT1, IFT2 and IFT3 according to the present invention are arranged to summarize unaggregated data subject to the constraints noted above using adjusted-weight summarization, and to use merging and addition steps that advantageously preserve desirable data qualities to provide a resulting data summary in a form that allows us to answer approximate queries with respect to the data most accurately.
Theoretical Background
To understand the data qualities that the present invention seeks to obtain, and how the summarization methods of the present invention achieve these qualities, some definition of terminology and some background explanation is necessary with respect to adjusted-weight summaries and their variances.
A weight assignment w: U is a function that maps all keys in some universe to non-negative real numbers. There is a bijection between weight assignments and corresponding weighted sets, and we use these terms interchangeably.
The weighted set that corresponds to a weight assignment w is the pair (I,w), where I≡I(w)⊂U is the set of keys with strictly positive weights. (Thus, w is defined for all possible keys (the universe U) but requires explicit representation only for I.)
A data point (I, x) corresponds to a weight assignment w such that w(i)=x and w(j)=0 for j≠i.
In the following description, we include various definitions, theorems, and lemmas, but for simplicity have omitted proofs.
DEFINITION 1. Adjusted-weight summary (AW-summary) of a weight assignment w is a random weight assignment A such that for any key iεU, E[A(i)]=w(i).
AW-summaries support estimating the weight of arbitrary subpopulations: For any subpopulation J⊂U,
is an unbiased estimate of w(J). Note that the estimate is obtained by applying the selection predicate only to keys that are included in the summary A and adding up the adjusted weights of keys that satisfy the predicate.
Different AW-summaries of the same weighted set are compared based on their size and estimation quality. The size of a summary is the number of keys with positive adjusted weights. The average size of an AW-summary is E[|{i|A(i)>0}|]. An AW-summary has a fixed size k if it assigns positive adjusted weight to exactly k keys.
Variance is the standard metric for the quality of an estimator for a single quantity, such as the weight of a particular subpopulation. In particular, the variance of A(i) (the adjusted weights assigned to a key i under AW-summary A) is
VARA[i]≡VAR[A(i)]=E[(A(i)−w(i))2]=E[A(i)]2−w(i)2
and the covariance of A(i) and A(j) is
COVA[i, j]≡COV[A(i), A(j)]=E[A(i)A(j)]−w(i)w(j).
The variance for a particular subpopulation J is equal to
Since AW-summaries are used for arbitrary subpopulations that are not specified a priori, the notion of a good metric is more subtle. There is generally no single AW-summary that dominates all other of the same size on all subpopulations (it is very easy to construct AW-summaries that have zero variance on any one subpopulation but are very bad otherwise).
The average variance over subpopulations of certain weight or size was considered by M. Szededy and M. Thorup (O
An AW-summary preserves total weight if Σiε1 A(i)=w(I) (Therefore VΣ[A]=0 and is minimized.) The average variance, among AW-summaries that preserve total weight, is minimized when ΣV[A] is minimized. For two total-preserving AW-summaries, A1 and A2 of the same weighted set, the ratio of the average variance over any subpopulation size is ΣV[Ai]/ΣV[A2].
In practice, average variance is an insufficient measure, as we need to be able to bound the variance on arbitrary subpopulations (avoid pathological cases) and obtain confidence intervals. Therefore, this metric is complemented by limiting the covariances structure so that the variance over subpopulations is more “balanced.” An AW-summary A has non positive covariances if for every two keys i≠j, COVA[i, j]≦0 (equivalently, E[A(i)A(j)]≦w(i)w(j)). We similarly consider zero covariances, if for every two keys i≠j, COVA[i, j]=0. A case for the combined properties of total preserving and non-positive covariances was made in E Cohen and H. Kaplan, T
Combining the above desirable properties, we say that an AW-summary is optimal if ΣV is minimized, VΣ=0 (it is total preserving), it has non-positive covariances, and it has a fixed size. This combination of desirable properties dates back to A. B Sunter, List sequential sampling with equal or unequal probabilities without replacement (Applied Statistics, 26:261-268, 1977), but was first realized by E. Cohen, N. Duffield, H. Kaplan, C. Lund, and M. Thorup, V
All the AW-summarizations we propose and evaluate preserve total weight and have fixed size and non-positive covariances and thus same-size summaries can be conveniently compared using the one-dimensional metric ΣV.
An AW-summary has Horvitz-Thompson HT adjusted weights [see D. G. Horvitz and D. J. Thompson, A
The sum w=w1⊕w2 of two weight assignments w1 and w2 is a weight assignment defined by key-wise addition, w(i)=w1(i)+w2(i) (i ε U).
For the sum (merge) of the corresponding weighted sets we use the notation
(I1,w1)⊕(I2,w2)=(I1∪I2, w1⊕w2).
The definition naturally applies to the sum A1⊕A2 of random weight assignments A1 and A2, (and in particular also to AW-summaries), and extends to the sum of multiple weight assignments w1⊕w2 ⊕ . . . ⊕ wh=⊕j=1h wj. Observe that the sum operation is commutative.
Some important properties, including being an AW-summary, are additive. Let wj(1≦j≦h) be weight assignments with respective AW-summaries Aj. Let W=⊕j=1h wj. Then two lemmas follow:
LEMMA 1. The random weight assignment A=⊕j=1h Aj is an AW-summary of w.
LEMMA 2. If the AW-summaries Aj are independent, the covariances are additive. That is,
Proof of LEMMAS 1 and 2 is omitted from this specification for simplicity.
The sum of AW-summaries preserves the non-positive covariances and zero covariances properties:
COROLLARY 3. If the AW-summaries Aj are independent, then if Aj (j=1, . . . , h) have the non positive covariances or the zero covariances properties then so does the AW-summary ⊕j=1h Aj. This follows because if all summands are zero or non-positive, so is their sum.
COROLLARY 4. If the AW-summaries Aj are independent, then for each key i ε U,
COROLLARY 5. If the AW-summaries Aj are independent, then
DEFINITION 6. An adjusted-weights summarization scheme (AW-summarization) is a mapping from a set of weight assignments to corresponding AW-summaries, that is, for a weight assignment w, (w) is an AW-summary of w.
If we can apply to a random weight assignment A, we use the notation ∘ A≡(A)
We can now establish transitivity of AW-summary properties under the composition operation.
LEMMA 7. Let A be an AW-summary, and let be an AW-summarization defined over the range of A. (We define the range of a probability distribution to be all points with positive probability (or positive probability density).) Then:
The composition ∘ ∘ . . . ∘ (the domains must be compatible for this to be defined) of several AW-summarizations is also an AW-summarization.
Suppose A, B are AW-summaries of w with the property that E[B(i)|A]=A(i) for all i ε U. Then COVB[i, j|A] will denote the conditional covariance of B(i), B(j), i.e., conditioned on A. Set VARB[i|A]=COVB[i, i|A]. The following is a Law of Total (co)Variance for the present model.
LEMMA 8. For each pair of keys i, j ε U,
CO[i,j]=E[CO[i,j|A]]+COVA[i, j]
COROLLARY 9. For each key I ε U,
VA[i]=VARA[i]+E[VA[i|A]]
In particular, we have, in an obvious notation, ΣV[∘A]=ΣV[A]+E[ΣV[∘A|A]].
LEMMA 10. If the AW-summaries A and the range of preserve total weight, have the zero covariances property, or have the non positive covariances property, then so does the AW-summary ∘A. (The proof follows directly from LEMMA 8.)
As described above with reference to
The internal summarization at each node can be performed by first adding (merging) the weighted sets collected from is children and then applying an AW-summarization to the merged set that reduces it as needed to satisfy the capacity constraint on the edge directed to its parent. There may be internal IFT constraints at the node, however, that do not allow for efficiently merging the input sets: We may want to speed up the summarization process by partitioning the input among multiple processors, or the data may be stored in external memory, and at the extreme, if internal memory only suffices to store the output summary size, it may be preferable to process the concatenated inputs as an unaggregated stream.
The additivity and transitivity properties of AW-summaries guarantee that if each basic summarization step at and below a node utilizes total-preserving and non-positive covariances AW-summarization, then the output of the node is also a total preserving and non-positive covariances AW-summary.
Note that for this property to hold, the IFT structure does not have to be fixed. The IFT nodes represent operations on the data. The next operation (in the structure above the node) can depend on the output and the operation itself can depend on the input data points. For a certain data set, we can consider a family of such recursive IFTs (which allow, for example, for different arrival orders of data points or for variable size streams).
In the summarization of an unaggregated data stream, a fixed-size summary S of size k is propagated from child to parent. Each parent node adds the new single data point (i′, w′) to the summary to obtain S′=S⊕{({i′}, w′)}. If S′ contains k+1 distinct items (that is, the key i′ does not appear in S), we apply an AW-summarization that reduces the summary from size k+1 back to size k.
The basic building block of data stream summarization is an AW-summarization that inputs a weighted set of size k+1 and outputs a weighted set of size k (removes a single key).
Interestingly, any AW-summary that produces a size k AW-summary from a size k+1 weighted set using HT adjusted weights has the non-positive covariances property:
LEMMA 11. Consider an AW-summarization that for an input weighted set of size k+1 produces summaries of fixed size k, (for inputs that are already of size k, it return the input set) and uses the HT adjusted weights. This AW-summarization has non positive covariances.
Interestingly, there is a unique such AW-summarization that is also total-preserving and minimizes ΣV, which means it is locally optimal for this primitive. The scheme is L-VAROPTk (local application of VAROPT). We refer to an application of our summarization algebra on an unaggregated stream in conjunction with the L-VAROPTk primitive as SA-STREAM-VOPTk.
When the IFT constraints allow, instead of adding one data point at time we can consider a sequence of batch additions (merges) of sets of data points followed by summarizations. The motivation for batch additions before summarizing is that we extend the local optimality (minimal ΣV) from being per data-point to being per batch. Formally, for a weighted set (J,A)(representing current summary) and data points (i1, w1), . . . , (ir, wr)
The left hand side, by optimality of VAROPTk, is the minimum ΣV for size-k AW-summaries of the weighted set (J, A) ⊕⊕j=1r{(ir, wr)}. The right hand side is another AW-summary of this weighted set. Concretely, consider a node that obtains multiple size-k summaries from its children, can internally store size k′ summary in memory (k′≧k), and outputs a size k summary. If the number of distinct keys is at most k′, we should merge the input summaries before summarizing them to size k. If k′=k, we apply SA-STREAM-VOPTk on the concatenation of the inputs. Otherwise, we add data points until we have k′ distinct keys (this is effectively a partial merge), apply SA-STREAM-VOPTk′ to the remaining data points, and apply VAROPTk to the result.
We refer to the generic application of our summarization algebra (arbitrary addition and summarization steps) with L-VAROPT as the summarization primitive as SA+VOPT. If the data happen to be aggregated and all intermediate summarizations allow summary size that is at least the output size, then SA+VOPT is an instance of VAROPT. Therefore, by leveraging VAROPT as a building block, ΣV gracefully converges to the optimal when the data is more aggregated and attains it if the data set happens to be aggregated.
As typical for “online” problems, we can show that there is no IFT-constrained summarization algorithm of unaggregated data sets that minimizes ΣV. This is in contrast to aggregated data sets (where VAROPT minimizes ΣV).
THEOREM 12. There is no AW-summarization algorithm for unaggregated streams that produces a fixed-size summary that minimizes ΣV. (Proof omitted.)
Given Theorem 12, it is not very surprising that we could construct an example where SA-STREAM-VOPT has a slightly larger ΣV than
We conclude the theoretical discussion with a conjecture. We define the competitive ratio of an AW-summarization as the worst-case ratio (over all applicable unaggregated inputs data sets) between ΣV and the minimum possible ΣV on the corresponding aggregated data for summary of the same size. The competitive ratio of
When considering SA+VOPT on a data set and corresponding family of IFTs, we define k′ to be the smallest size of an intermediate summary on which L-VAROPT is applied (that is, the smallest i such that L-VAROPTi is used). We conjecture that the ratio of ΣV to ΣV [VAROPTk′] is bounded by a constant. In practice SA+VOPT is very close to optimal and outperforms all other algorithms.
Accordingly, in order to provide improved processing, the present invention is implemented in method 100, which maintains the summary in a tuned data structure that reduces worst-case per-data point processing to amortized 0(log k). The implementation is fast and further benefits from the fact that the theoretical amortized 0(log k) bound applies to worst-case distributions and arrangements of the data points. Performance on “real” sequences is closer to 0(1) time per data point and holds for randomly permuted data points.
In method 100, the input is an unaggregated stream of data points (a, w) where a is a key and w is a positive weight. The output is a summary A which is a sample of up to k keys. Each included key a has an adjusted weight ŵa. If a key is not in A its adjusted weight is 0. Method 100 proceeds using the summarization algebra described above, and thus the summary A has the advantageous properties that accompany its use.
In method 100, a threshold τ, initially set to 0, is calculated. The keys in A are partitioned into two sets or reservoirs L and T, each initially empty and populated with keys a in the data stream with adjusted weights ŵa as will be described below. In accordance with method 100, each a ε L has a weight wa≧τ. The set L is stored in a priority queue which always identifies the key with the smallest weight minaεL wa. As will be described below, when a new data point arrives, a determination is made whether to move the key with the smallest weight from set L. Each a ε T has a weight wa≦τ. The set T is stored as a prefix of an array of size k+1. For every a ε A, the adjusted weight is ŵa=max{τ,wa}. Thus ŵa=wa for a ε L while ŵa=τ for a ε T.
Referring to
In step 104, the set or reservoir L is populated with arriving data points until it contains k different keys.
Returning to
If step 106 determines that the data stream has not ended, in step 108 new data point arrivals (the k+1 data points) are tested to determine whether their keys and weights are to be included in sets L or T, and whether other keys and weights are to be moved or removed in order to provide a sample with just k keys. The tests of step 108 are shown in greater detail in
Referring now to
If step 300 determines that the new data point is not in A, then the method proceeds to step 308 to apply a test to determine if keys with low adjusted weights in L are to be moved to T, and then to step 310 to apply a test to determine which key to remove to maintain the number of keys in A at k. The method then returns to the threshold updating step 110 in
Step 308 is shown in greater detail in
Referring to
Steps 404 and 408 then proceed to step 410, which determines if smallsum≧(|T|+|X|−1)minbεL wb, and if so proceeds to find the key b in L with minimum weight, i.e., b←arg minbεL wb, and moves b from L to the end of X. As mentioned previously, set L is preferably stored as a priority queue, identifying the minimum weight key b, so that the steps for locating and moving it are simplified. Set X is also preferably stored as a priority queue to facilitate random selection of a key to remove. Step 410 then updates smallsum←smallsum+wb, and returns to the beginning of step 410 to determine if a new minimum adjusted weight member of L should be moved to X.
At the conclusion of step 410, one or more low adjusted weight keys in L will have been moved to the temporary reservoir X, and the method then proceeds to step 310 to determine which key to remove from T or X to maintain the number of keys in summary A at k.
Step 310 is shown in
If step 504 determines that it is not true that r<|T|(1−τ/t), then in step 510 r is updated as r←r−|T|(1−τ/t), and d is set as d←0. Then in step 512, while r>0, an X[d] is found such that rΔr−(1−wX[d]/t), then d is updated as d←d+1, and in step 514 X[d] is removed from X, and the number of keys in summary A remains at k.
The removal of a key from T in step 508 or the removal of a key from X in step 514 result from a random selection process of the keys in T or X, which by reason of their selection for placement in these sets have adjusted weights below the threshold τ, and thus their removal does not influence the more significant weights of keys included in L. The selection process is consistent with the HT conditions and preserves the quality of the sample A.
After keys are removed from T in step 508 or from X in step 514, the method proceeds to step 516, where T is updated as T←T∪X. At this point step 310 is completed, and the method proceeds to step 110 of
The method described above for summarizing unaggregated data in a stream has been evaluated in comparison to other previously-known methods and the results have shown the method of the invention to provide improved results with lower variance providing tighter estimates than prior methods and indeed performs very closely to the unattainable optimum available with aggregated data.
Thus, the invention describes a feature enabling unaggregated data to be summarized in situations where processing resources (storage, memory, time) are constrained. While the present invention has been described with reference to preferred and exemplary embodiments, it will be understood by those of ordinary skill in the art that various changes may be made and equivalents may be substituted for elements thereof without departing from the scope of the invention. In addition, many modifications may be made to adapt a particular situation to the teachings of the invention without departing from the scope thereof. Therefore, it is intended that the invention not be limited to the particular embodiments disclosed, but that the invention include all embodiments falling within the scope of the appended claims.