1. Field of the Invention
The present invention generally relates to estimating cascaded aggregates over a matrix presented as a sequence of updates in a data stream. The problem of efficiently computing a cascaded aggregate for various applications with this method presents itself in several applications involving time-series data. For example, the analysis of credit card fraud may consist of first identifying high-valued transactions for each customer, and then computing the average of all the customers. Other examples include stock transactions, where aggregates are determined over all customers for each company, and then aggregates are determined over all of the companies. In network traffic analysis, aggregates are determined over all destination addresses for each source address, and then aggregates are determined over individual source addresses.
2. Description of the Related Art
Formally, the data stream consists of arbitrary additive updates to elements (i, j), (see
A cascaded aggregate P∘Q is defined by evaluating aggregate Q repeatedly over each row of the matrix, and then evaluating aggregate P over the resulting vector of values. This problem was introduced by Cormode and Muthukrishnan.
Previously, Cormode et al., “Time-Decaying Aggregates in Out-of-order Streams,” DIMACS Technical Report 2007-10, “Estimating the Confidence of Conditional Functional Dependencies,” SIGMOD '09, Jun. 29-Jul. 2, 2009, and Muthukrishnan presented methodologies where Q=Count-Distinct for different choices of P, in the context of mining multigraph data streams.
The problems with these methodologies are that they are too specific. First, they only solve a special case of the problem, when Q=Count-Distinct, and second, they do not work in a general data stream where one is allowed to insert and delete items.
An exemplary aspect of an embodiment of the invention includes a method of approximating aggregated values from a data stream in a single pass over the data-stream where values within the data-stream are arranged in an arbitrary order, wherein the method includes, continuously receiving data sets from the data-stream using a computerized device, the data sets being arranged in the arbitrary order. The data sets are segmented according to previously established categories to create aggregates of the data sets using the computerized device. Variances are computed with respect to a mean of logarithmic values of the data sets using the computerized device, and averages of the variances are calculated to produce approximated aggregated values for the data stream using the computerized device. Finally, the approximated aggregate values are output from the computerized device.
With its unique and novel features, one or more embodiments of the invention provide a low-storage solution with an arbitrary ordering of data by maintaining random summaries, i.e., sketches, of the dataset, where the summaries arise from specific sampling techniques of the dataset.
The embodiments of the invention deal with complexity of estimating cascaded aggregates over a matrix presented as a sequence of updates and deletions in a data stream. A cascaded aggregate P∘Q is defined by evaluating aggregate Q repeatedly over each row of the matrix, and then evaluating aggregate P over the resulting vector of values. These have applications in the analysis of scientific data, stock market transactions, credit card fraud, and IP traffic.
The embodiments of the invention analyze the space complexity of estimating cascaded aggregates to within a small relative error for combinations of frequency moments (Fk) and norms (Lp).
1. For any 1≦k<∞ and 2≦p<∞, the embodiments of the invention obtain a 2-pass Õ(n2−2/p−2/(kp))-space methodology for estimating Fk∘Fp. This is the embodiments of the invention main result, and is optimal up to polylogarithmic factors. In particular, the embodiments of the invention resolve an open question regarding the space complexity of estimating F2∘F2. The embodiments of the invention also obtain 1-pass space-optimal methodologies for estimating F∞∘Fk and Fk∘F∞.
2. For any k≧0, the embodiments of the invention obtain a 1-pass space-optimal methodology for estimating Fk∘L2. The embodiments of the invention techniques also solve the “heavy hitters” problem for rows of the matrix weighted by L2 norm.
The foregoing and other exemplary purposes, aspects and advantages will be better understood from the following detailed description of an exemplary embodiment of the invention with reference to the drawings, in which:
Referring now to the drawings, and more particularly to
The recent explosion in the processing of terabytesized data sets has led to significant scientific advances as well as competitive advantages for economic entities. With the widespread adoption of information technology in healthcare, and in the tracking of individual clicks over the internet, massive data sets have become increasingly important on a societal and personal level. The constraints imposed by processing this massive data have inspired highly successful new paradigms, such as the data stream model, in which a processor makes a quick “sketch” of its input data in a single pass and is able to extract important statistical properties of the data. This has yielded efficient methodologies for several classical problems in the area including frequency-based statistics, ranking based statistics, metric norms, and similarity measures (clustering the entries of the dataset into geometrically increasing intervals, and sampling a few items within each interval), and a complementary rich set of lower-bound techniques and results.
Classically, frequency moments and norms have played a major role in the foundations of processing massive data sets. Given a stream X in the turnstile model, let fa(X) denote the total weight of an item a induced by the increments and decrements, possibly weighted, to a. Define the k-th frequency moment
F
k(X)Ea|fa(X)|k
and the k-th norm
L
k(X)(Fk(X))1/k.
Special cases include distinct elements (F0), Euclidean norms (L2 and F2), and the mode (F1), all of which have been studied thoroughly. Estimating Fk for k>2 has applications in statistics to estimating the skewness and kurtosis of a random variable that provide a measure of asymmetry of a distribution. Let μk=E[(X−E[X])k] be the k-th moment of X about the mean; the second moment of X about the mean, μ2=σ2 is the variance. Skewness is formally defined as the third moment of X about the mean, μ3/σ3, and kurtosis is formally defined as the fourth moment of X about the means, μ4/σ4−3. Skewness and kurtosis are used frequently to model and understand risk. Finally, they have also influenced the development of several related measures such as entropy and heavy-hitters.
Frequency moments and norms are a useful measure for single-shot aggregation. Most applications however deal with multi-dimensional data. In this scenario, the real insights are obtained by slicing the data multiple times, which involves applying several aggregate measures in a cascaded fashion. The following examples illustrate the power of such analysis:
Economics: In a stock market, the changes in various stock prices are recorded continuously using a quantity rlog known as “logarithmic return on investment”. To compute the average historical volatility of the stock market from the data, the data needs to be segmented according to the stock name, compute the variance of the rlog values recorded for that stock (i.e., normalized L2 around the mean), and then compute the average of these values over all stocks (i.e., normalized F1). Similarly, estimating the kurtosis risk in credit card fraud involves aggregating high-volume purchases made on individual credit card numbers. This is akin to computing F1 on the transactions of individual credit cards followed by F4 on the resulting values.
IP traffic: Cormode and Muthukrishnan considered various measures for IP traffic which could be used to identify whether large portions of the network may be under attack. A skewness measure that captures this property involves grouping the packets by source address, computing F0 on the packets within each group based on the destination address (to count how many destination addresses are being probed) and then computing F3 on the resulting vector of values for the source nodes.
Computational geometry: Consider indexed pointsets P={p1, . . . , pn} and Q={q1, . . . , qn} where each point belongs to Rd of high dimension. A useful distance measure between P and Q is the sum of squares of Lp distances between corresponding pairs of points, i.e.,
Σi∥pi−qi∥p2.
If P contains k-distinct points (i.e., the matrix has k distinct rows), this could be the cost of the k-means problem with Lp-distances. If P is the projection of Q onto a k-dimensional subspace, this could be the cost of the best rank-k approximation with respect to squared Lp distances, a generalization of the approximate flat fitting problem to Lp distances.
Matrix approximation: Two measures that play a prominent role in matrix approximation are operator norm and maximum absolute row-sum norm. For a matrix A whose rows are denoted by A1, A2, . . . , An, they are defined as maxi∥Ai∥2 and maxi∥A2∥1, respectively.
Product Metrics: The Ulam distance between two non-repetitive sequences is the minimum number of character insertions, deletions, and substitutions needed to transform one sequence into the other. It is shown that for every “gap” factor, there is an embedding of the Ulam metric on sequences of length d into a product metric that preserves the gap. This embedding transforms the sequence into a dO(1)×dO(1) matrix; the distance between two matrices is obtained by computing the l∞ distance on corresponding rows followed by a l22 computation. Interestingly, another embedding involving three levels of product metrics. The authors attempt to sketch F2∘L∞∘L1, though they are not able to sketch this metric directly. Instead, they use additional properties of their embedding into this product metric to obtain a short sketch which is sufficient for their estimation of the Ulam metric.
The following problem captures the above scenarios involving two levels of aggregation:
Definition 1 (Cascaded Aggregates). Consider a stream X of length n consisting of updates to items in [m]×[m], where m=nO(1). Let M denote the matrix whose (i, j)-th entry is fij(X). Given two aggregate operators P and Q, the cascaded aggregate P∘Q is obtained by first applying Q to each row of M, and then applying P to the resulting vector of values. Abusing notation, the embodiments of the invention also apply P∘Q to X and denote (P∘Q)(X)=P(Q(X1), Q(X2), . . . , Q(Xm)), where Xi for each i denotes the sub-stream of X corresponding to updates to item (i, j) for all j∈[m].
Cormode and Muthukrishnan focused mostly on the case P∘F0 for different choices of P. For F2∘F0, they gave an methodology using Õ(√n) space (whereas the tilde notation hides poly(log n,1/∈) factors throughout this disclosure); for the heavy-hitters problem, they gave an methodology using space Õ(1) that returns a list of indices L such that (1) L includes all indices i such that F0(Xi)≧φm and (2) every index i∈L satisfies F0(Xi)≧(φ−∈)m.
The embodiments of the invention design computer-implemented methodologies for estimating several classes of cascaded frequency moments and norms. First, the embodiments of the invention give a near-complete characterization of the problem of computing cascaded frequency moments Fk∘Fp. The embodiments of the invention main result, and also technically the most involved, is the following:
for any k≧1 and p≧2, the embodiments of the invention obtain a 2-pass Õ(n2−2/p−2/(kp))-space methodology for computing a (1±∈)-approximation to Fk∘Fp.
The embodiments of the invention prove that the complexity of the above-referenced computer-implemented methodology is optimal up to polylogarithmic factors. In particular, the embodiments of the invention show that the space complexity of estimating F2∘F2 is Θ(√n).
At the basic level, the computer-implemented methodology for Fk∘Fp cannot compute Fp(Xi) individually for every i since that would take up too much space, which rules out using previous methodologies for frequency moments as a blackbox. On the other hand, the embodiments of the invention safely ignore those rows whose Fp(Xi) values are relatively small. The crux of the embodiments of the invention problem is to focus in on those rows that have a significant contribution in terms of its Fp value without calculating them explicitly. This inherently forces us to delve deeper into the structure of methodologies for frequency moments. A promising direction is an methodology of which also yields an approximate frequency histogram. This can be used as a basis to non-uniformly sample rows from the input matrix according to its Fp value, and output an appropriate estimator. Although the estimator is straightforward, the analysis of this procedure is somewhat subtle due to the approximate nature of the histogram. However, a new wrinkle arises because the variance of the estimator is too large, and the samples obtained from the approximate histogram are not sufficient. Further, repeating the procedure will result in a huge blow-up in space.
The embodiments of the invention design a new computer-implemented methodology for obtaining a large number of samples according to an approximate histogram for Fp. The embodiments of the invention computer-implemented methodology uses a framework but adds new ingredients to limit the space used to generate the samples. In particular, the embodiments of the invention resort to another sub-sampling procedure to handle levels that have much more items than the expected number of samples needed from this level. The embodiments of the invention analysis then show that the samples from the approximate histogram estimator suffice to approximate Fk∘Fp. The computer-implemented methodology uses two (2) passes due to the separation of the sampling step from the step that evaluates the estimator.
Next, the embodiments of the invention study the problem of computing cascaded norms Lk∘L2. For any k>0, the embodiments of the invention obtain a 1-pass space-optimal methodology for computing a (1±∈)-approximation to Fk∘L2. The embodiments of the invention techniques also allow us to find all rows whose L2 norm is at least a constant φ>0 fraction of F1∘L2 in Õ(1) space, i.e., to solve the “heavy hitters” problem for rows of the matrix weighted by L2 norm.
Finally, for k≧1, the embodiments of the invention obtain 1-pass space-optimal methodologies for F∞∘Fk and Fk∘F∞.
The computer-implemented methodologies also have applications for entropy measures. This is very similar to an Fk estimation methodologies in a blackbox fashion setting k>1 close enough to 1 to estimate the entropy of a data stream.
As previously noted, Ganguly, Bansal, and Dube claimed an Õ(1)-space methodology for estimating Fk∘Fp for any k, p in [0, 2]. A simple reduction from multiparty set disjointness shows this claim is incorrect for any k, p for which k·p>2. Indeed, for such k and p a simple reduction from multiparty set disjointness shows that poly(n) space is required.
Reducing Randomness: For simplicity, the embodiments of the invention describe the computer-implemented methodologies using random oracles, i.e., they have access to an unlimited randomness including the use of continuous distributions. These assumptions can be eliminated by the use of pseudo-random generators, (PRGs), similar to the way Indyk used Nisan's generator. The extra ingredient, whose application to streaming methodologies seems to have escaped notice before, is the use of the PRG due to Nisan and Zuckerman and can be applied when the space used by the data stream methodology is nΩ(1). The advantage is that it does not incur the extra log factor in space incurred by Nisan's generator. Note that the same approach also results in a similar improvement in space in previous methodologies for frequency moments. This is summarized in the proposition below. It can be checked that the computer-implemented methodologies indeed satisfy the assumptions—the arguments are tedious but similar to those found in Indyk.
Proposition 2. Let P be a multi-pass, space s(n), data stream methodology on a stream X using (distributional) randomness R satisfying the following:
1. There exists a reordering of X (e.g., sort by item id) called X′ such that (i) all updates to each item a in X appear contiguously in X′, and (ii) P(X,R)=P(X′,R) with probability 1;
2. R can be broken into jointly independent chunks Ra,k over items a and passes k such that the only randomness used by P while processing updates to a in the k-th pass is Ra,k;
3. for each a and k, there exists a polylog(n)-bit randomstring Ra,k=t(Ra,k) (e.g., via truncation) with the property that |P(X,R)=P(X,R)|≦n−Ω(1) with probability 1.
Then there is an methodology P′ using random bits R′ with the following properties:
The following is a convenient restatement of Hölder's inequality:
Proposition 3 (Hölder's inequality). Given a stream X of updates to at most M distinct items,
F
2(X)≦M1−2/p·Fp(X)2/p, if p≧2, and F1(X)·M1−1/k·Fk(X)1/k, if k≧1
1. Cascaded Frequency Moments
Let Fkp(X), for brevity, denote the cascaded frequency moment Fk∘Fp. In this section, the embodiments of the invention include a design of a 2-pass methodology for computing a 1±∈ estimate of Fkp when k≧1, p≧2 using an optimal space Õ(m2−2/p−2/kp). The lower bound follows via a simple reduction from multiparty set disjointness. Specifically, the inputs are t=(2m)1/p+1/kp subsets such that on a NO instance, the sets are pairwise disjoint, and on a YES instance there exists (i, j) such that the intersection of every distinct pair of sets equals (i, j). The sets translate into an input X for Fkp in a standard manner. For a NO instance, fij∈{0,1} for every i, j. Therefore Fkp(X)≦Σimk=mk+1. For a YES instance, fij=t for some i,j. Therefore, Fkp(X), tkp=(2m)k+1. From the known communication complexity lower bounds for multiparty set disjointness for any constant number of passes, the space lower bound for Fkp is Ω(m2/t2)=Ω(m2−/p−2/kp).
1. Overview of the Methodology
The idealized version of the computer-implemented methodology is inspired by the methodology for computing Fk for k≧2. Consider the distribution on the rows of M, where the probability of choosing i is proportional to Fp(Xi). If a sampling of a row I according to this distribution, then Fp(X1)k−1 can be shown to be an unbiased estimator of Fkp(X). By bounding the variance, it can be shown that there is a need to sample the rows m1−1/k many times to obtain a good estimate of Fkp.
The key obstacle is the sampling procedure. At the basic level, it is not beneficial to compute Fp(Xi) for every i since that would take up too much space. For this, a subsampling technique is used by to give space-optimal methodologies for Fp. For this, the embodiments of the invention momentarily bypass the matrix structure and view items (i, j) as belonging to a domain D of size m2. The goal will be to produce a sufficiently large number of weighted samples (i, j) according to its |fij(X)|p value, and then use it to give an estimator for Fkp(X). The subsampling technique however produces an approximate histogram that is only sensitive to Fp(X) (and ignores k): items are bucketed into groups, and groups that do not have a significant overall contribution to Fp(X) are implicitly discarded by the procedure. The embodiments of the invention analysis will show that the estimator is still a good approximation to Fkp(X) in expectation. The variance causes a significant problem since one cannot run the sampling procedure several times to produce independent samples as that will cause severe blow-up in space. The embodiments of the invention overcome this by scavenging enough samples from each iteration of the subsampling procedure so that the space used is optimal.
2. Producing Samples Via an Approximate Histogram for Fp.
Fix a stream X whose items belong to an arbitrary set D of size nO(1). The embodiments of the invention partition items into levels according to their weights and identify levels having a significant contribution to Fp(X).
Notation: For η≧1, We say that x approximates y within η if y≦x≦η·y, and denote it by:
Definition 4. Let η=(1+∈)Θ(1) and B≧1 denote two parameters. Define the level sets:
S
t(X)={a∈D:|fa(X)|∈[nt−1,nt]} for 1≦t≦Cη log η, for some Cη. Call a level t contributing if
where ∂=poly(log(n)/∈) will fixed by the analysis below. For a contributing level t, items in St(X) will also be called contributing items.
The main result of this section is a sampling methodology geared towards contributing items. The key new ingredient is stated in
Theorem 5. There is a one-pass methodology procedure called SAMPLE (X, Q; B, η) using space Õ((B2/p+Q2/p)·|D|1−2/p) that outputs the following (with high probability):
1. a set G that includes all contributing levels and values st for t∈G such that
2. A quantity Φ such that
3. Q i.i.d samples such that for each individual sample, the probability qa that a is chosen satisfies
Proof. In the proof, the embodiments of the invention will sometimes suppress the dependence on X for ease of presentation. Parts 1 and 2 essentially follow combining subsampling and the F2 heavyhitters methodology to identify contributing levels. The key idea that drives the methodology is that for a contributing level, by Hölder's inequality,
Using these ideas, an methodology of returns values st for all t such that st≦η|St|, and if t contributes, then st≧|St|. The methodology also returns Fp with Fp≦Fp≦ηp+1Fp.
Define τ=Fp/(B∂ηp+1). The embodiments of the invention put t in G iff stηpt≧τ.
Claim 6. If t is contributing, then t is in G.
Proof. By definition of contributing, |St|ηpt, Fp/(B∂), which is at least Fp/(B∂ηp+1). Moreover, since st≧|St|, this implies that stηpt≧Fp/(B∂ηp+1), which is τ, and thus τ is in G.
Claim 7. If t is in G, then st≧|St|/ηp+1.
Proof. If t contributes, this follows by the definition of contribution. So suppose that t does not contribute, so that |St|ηpt≦Fp/(B∂). Since t is in G, stηpt≧τ=Fp/(B∂ηp+1), but the latter quantity is ≧Fp/(B∂ηp+1) since Fp≧Fp. Hence, st, Fp/(B∂ηp+1)≧|St|/ηp+1, as desired.
The embodiments of the invention rescale the st values for t∈G by multiplying them by ηp+1. Claims 6 and 7 now imply part 1. The space used equals Õ((B∂)2/p·|D|1−2/p)=Õ(B2/p·|D|1−2/p).
For part 2, let Φ=Σt∈Gsτ(X)·ηpt. It is not hard to show that
by a bounding argument. This is because there are three sources of error:
(1) the frequencies in the St are discretized into powers of η;
(2)
and
(3) Φ ignores St for t G. For (3), the embodiments of the invention need to assume that ∂ is sufficiently large.
For Part 3, fix t∈G and let
The quantity αt represents the expected number of samples that are needed from level t. Assume w log that Q≧ηp+1·B∂2(n); this will affect the space bound claimed in the theorem by only an Õ(1) factor. By definition of t in G, and by parts 1 and 2, the embodiments of the invention have
The embodiments of the invention will now show how to obtain a uniform set of βt=c1·min(αt, st) samples without replacement from each contributing t, where c1=Õ(1). Let j≧0 be such that st/2j·βt<st/2j−1. The key idea is sub-sampling: let h:D→{0,1} be a random function such that h(a)=1 with probability 1/2j and the values h(a) for all a are jointly independent. In the stream, items a such that h(a)=0 are discarded. Let Yj denote the stream of the surviving items. By Markov's inequality, the embodiments of the invention get that with high probability, (*)Fp(Yj)≦c2Fp(X)/2j and (**) the number of distinct items in Yj is at most c3|D|/2j, where c2=c3=Õ(1).
Now
which by rewriting and applying Part 2 yields
By Hölder's inequality, (*) and (**) above
for some C=Õ(1) since p≧2 implies that (2j)1-2/p≧1. Thus by running an F2-heavy hitters methodology on Yj, the embodiments of the invention will find every sub-sampled item of St. With high probability, the embodiments of the invention can show that the number of items will be Ω(βt) which by rescaling βt by an Õ(1) factor, is at least c1·min(αt, st), the number of samples needed.
To finish the proof, for each iteration q=1, . . . , Q, we pick a level t∈G with probability
By Markov's inequality and union bound, no level t is picked more than c1αt times with high probability. By the argument above, the embodiments of the invention indeed have this many samples for each t but these are samples obtained without replacement. Then, by Lemma 8 shown below, the embodiments of the invention get a uniformly chosen sample in St, independent of the other iterations. The probability that a contributing item a belonging to level t is chosen is given by:
Lemma 8. If the embodiments of the invention have a sample of size t, chosen uniformly without replacement from a domain of known size, then the embodiments of the invention can obtain a sample of size t chosen uniformly with replacement.
3. Computing Fkp when k≧1, p≧2.
Recalling the setup, the embodiments of the invention are given a stream X of items of length n, each belonging to [m]×[m]. Let Xi denote the sub-stream of X corresponding to updates to item (i, j) for all j∈[m]. The embodiments of the invention show how to compute
kp(X)Σi(Σj|fij(X)|p)k=Σi|Fp(Xi)|k.
Consider the pseudo-code shown in Methodology 1, which runs in 2 passes.
Methodology 1: Compute Fkp(X).
1. Call SAMPLE (X,Q;B,η) with Q=B=m1−1/k, to obtain G, st for each t∈G, and Q samples.
2. Let Φ=Σt∈Gst·ηpt
3. For each sample (i, j), estimate Fp(Xi)k−1 by invoking Sample(X,Q;B,η) with Q=B=1. Let Ψ denote the average of the estimates for all samples.
4. Output Φ·Ψ.
The embodiments of the invention will prove the correctness of Methodology 1 via the following claims. First, the embodiments of the invention show that for estimating Fkp(X), the embodiments of the invention can eliminate the t's not in G.
Lemma 9. For any
t∉G,|S
t(X)|·ηpt≦
Proof. If
t∉G,
then by Theorem 5, t is not contributing. Hence,
By Hölder's inequality, for k≧1,
Setting B=m1−1/k the embodiments of the invention obtain
The next lemma shows that the t's in G provide a good estimate of Fkp(X).
Lemma 10. Define the stream Y by including only the items that belong to levels tεG in the stream X.
Proof. Let N denote the set of items that belong to levels t∈G. Since Fkp(X) is a monotonic function in terms of the various |fij(X)|'s, and deleting items in N causes their weights to drop to 0, it follows that Fkp(Y)≦Fkp(X). The embodiments of the invention will next show that Fkp(X)≦(1+∈)·Fkp(Y). First, there occurs:
Assume w log that
F
p(Y1)≧Fp(Y2)≧ . . . ≧Fp(Ym).
Since the function
f(x1,x2, . . . , xm)=Σi=1mxik
Now,
Σi>1Fp(Yi)k=
Σ(i,j)∈N|fij(X)|p≦Σt∉G|St(X)|·ηpt.
Substituting these bounds in (2),
Let
UFp(Y1) and VΣr∉G|St(X)|·ηpt.
Consider 2 cases. If U≧kV/∈, then
(U+V)k≦Uk(1+∈/k)k≦Uk(1+∈)=Fp(Y1)k(1+∈)≦Fp(Y1)k+∈
Substituting this bound in (3) proves the lemma for this case.
Otherwise, i.e., U<kV/∈. By Lemma 9,
Since U<kV/∈, we have
Choose the denominator ∂ to be small enough so that (U+V)k≦Fkp(X). Applying this bound in (3),
kp(X)≦
which completes the proof of the lemma.
Next, analyze Step 3 of the methodology:
Lemma 11. The probability of choosing a certain i in Step 3 approximates Fp(Yi)Φ within η2p+2.
Proof. By Theorem 5, the probability that (i, j) is chosen approximates |fij(X)|p/Φ within η2p+2 provided (i, j) is in a level which is in G and equals 0 otherwise. Summing over all such (i, j) for various j's,
Theorem 12. The output in Step 4 is a good estimate of Fkp(X).
Proof. By Lemma 11,
For each i within the sum, applying Theorem 5, part 2, it is known that Ψ approximates Fp(Xi)k−1 within η(2p+2)(k−1). Substituting in (4),
Observe that since Fp(Yi)≦Fp(Xi), one has Fkp(Y)≦A≦Fkp(X). Applying Lemma 10, and choosing η to be sufficiently close to 1 shows that the expected value of the estimator is a good approximation of Fkp(X). Turning to the variance,
Applying the same inequalities as above, Fp(Yi)≦Fp(Xi), and Φ≦η2p+2Fp(X), as well as Ψ≦η(2p+2)(2k−2)Fp(Xi)2k−2. Therefore,
Since,
F
p(X)=ΣiFp(Xi)≦m1−1/k
thus is obtained
E[(ΦΨ)2]≦m1−t/k
up to an Õ(1) factor, so there are just enough samples to obtain a good estimate of Fk,p(X).
Referring again to the drawings,
An average of the normalized Euclidean norms is calculated 406 for each set of data segmented according to the individual names over the data stream using the computerized device, and an average historical volatility is calculated based on the calculating the average of the normalized Euclidean norms using the computerized device 408. Finally, the average historical volatility is output from the computerized device 410.
Calculating the average historical volatility may be performed while continuously receiving the out-of-order data over an indefinite period of time. The out-of-order data may be received using a quantity r log, also known as a “logarithmic return on investment.” The individual names associated with the data may include stock names, for example. Computing the normalized Euclidean values around the mean values may further comprise computing a variance of the r log values.
With its unique and novel features, one or more embodiments of the invention provide a low-storage solution with an arbitrary ordering of data by maintaining random summaries, i.e., sketches, of the dataset, where the summaries arise from specific sampling techniques of the dataset, specifically, sampling the dataset at intervals at specific intervals according to a particular power, e.g., at a power of two (2): where intervals would comprise 1-2, 3-4, 5-8, 9-16, 17-32, 33-64, etc. Each interval is incremented each occurrence of that received data falls within a specified interval. The embodiment of the invention then will sample a single data point, (e.g., stock name, time, value), within a single interval. A second pass over the data then computes the variance of the sampled single data point on all the segmented data having the common value which the data was segmented, e.g., a stock name.
A method is given for efficiently approximating cascaded aggregates in a data stream in a single pass over a dataset, with entries presented to the methodology in an arbitrary order.
For example, in a stock market, the changes in various stock prices are recorded continuously using a quantity r log known as the logarithmic return on investment. The average historical volatility is computed from data by segmenting the data according to stock name, computing the variance of the r log values recorded for that stock (i.e., normalized Euclidean norm around the mean), and computing the average of these values over all stocks (i.e., normalized L1-norm).
Similarly, estimating the kurtosis risk in credit card fraud involves aggregating high-volume/value purchases made on individual credit card numbers. This is akin to computing the maximum norm on the transactions of individual credit cards followed by the L4-norm on the resulting values.
While previous data streaming methods address norm computation of datasets, the method here is the first to address the problem of cascaded norm computations, namely, the computation of the norm of a column of norms, one for each row in the dataset. Trivial solutions to this problem are obtained by either storing the entire database and performing an offline methodology, or assuming the data is presented in a row by row order. The first solution is impractical for massive datasets stored externally, which cannot even fit in RAM. The second solution requires an unrealistic assumption, i.e., that data is arriving on a network in a predictable order. The method presented here provides a low-storage solution with an arbitrary ordering of data by maintaining random summaries (e.g., sketches) of the dataset. The summaries arise from novel sampling techniques of the dataset.
As will be appreciated by one skilled in the art, an embodiment of the invention may be embodied as a system, method or computer program product. Accordingly, an embodiment of the invention may take the form of an entirely hardware embodiment, an entirely software embodiment (including firmware, resident software, micro-code, etc.) or an embodiment combining software and hardware aspects that may all generally be referred to herein as a ‘circuit,’ ‘module’ or ‘system.’ Furthermore, an embodiment of the invention may take the form of a computer program product embodied in any tangible medium of expression having computer usable program code embodied in the medium.
Any combination of one or more computer usable or computer readable medium(s) may be utilized. The computer-usable or computer-readable medium may be, for example but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, device, or propagation medium. More specific examples (a non-exhaustive list) of the computer-readable medium would include the following: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a random access memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or Flash memory), an optical fiber, a portable compact disc read-only memory (CDROM), an optical storage device, a transmission media such as those supporting the Internet or an intranet, or a magnetic storage device. Note that the computer-usable or computer-readable medium could even be paper or another suitable medium upon which the program is printed, as the program can be electronically captured, via, for instance, optical scanning of the paper or other medium, then compiled, interpreted, or otherwise processed in a suitable manner, if necessary, and then stored in a computer memory. In the context of this document, a computer-usable or computer-readable medium may be any medium that can contain, store, communicate, propagate, or transport the program for use by or in connection with the instruction execution system, apparatus, or device. The computer-usable medium may include a propagated data signal with the computer-usable program code embodied therewith, either in baseband or as part of a carrier wave. The computer usable program code may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc.
Computer program code for carrying out operations of an embodiment of the invention may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, Smalltalk, C++ or the like and conventional procedural programming languages, such as the ‘C’ programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the latter scenario, the remote computer may be connected to the user's computer through any type of network, including a local area network (LAN) or a wide area network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet Service Provider).
An embodiment of the invention is described below with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the invention. It will be understood that each block of the flowchart illustrations and/or block diagrams, and combinations of blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks.
Referring now to
In addition to the system described above, a different aspect of the invention includes a computer-implemented method for performing the above method. As an example, this method may be implemented in the particular environment discussed above.
Such a method may be implemented, for example, by operating a computer, as embodied by a digital data processing apparatus, to execute a sequence of machine-readable instructions. These instructions may reside in various types of signal-bearing media.
Thus, this aspect of the present invention is directed to a programmed product, including signal-bearing media tangibly embodying a program of machine-readable instructions executable by a digital data processor to perform the above method.
Such a method may be implemented, for example, by operating the CPU 710 to execute a sequence of machine-readable instructions. These instructions may reside in various types of signal bearing media.
Thus, this aspect of the present invention is directed to a programmed product, comprising signal-bearing media tangibly embodying a program of machine-readable instructions executable by a digital data processor incorporating the CPU 710 and hardware above, to perform the method of the invention.
The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the invention. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems that perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
The terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of any embodiments of the invention. As used herein, the singular forms ‘a’, ‘an’ and ‘the’ are intended to include the plural forms as well, unless the context clearly indicates otherwise. It will be further understood that the terms ‘comprises’ and/or ‘comprising,’ when used in this specification, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.
The corresponding structures, materials, acts, and equivalents of all means or step plus function elements in the claims below are intended to include any structure, material, or act for performing the function in combination with other claimed elements as specifically claimed. The description of the embodiments of the invention has been presented for purposes of illustration and description, but is not intended to be exhaustive or limited to the embodiments of the invention in the form disclosed. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the embodiments of the invention. The embodiment was chosen and described in order to best explain the principles of the embodiments of the invention and the practical application, and to enable others of ordinary skill in the art to understand the embodiments of the invention for various embodiments with various modifications as are suited to the particular use contemplated.