1. Field of the Invention
The present invention relates to traffic analysis in a network.
2. Description of the Related Art
A database is a collection of information. Relational databases are typically illustrated as one or more two-dimensional tables. Each table arranges the information in rows and columns, with each row corresponding to a record and each column corresponding to a field. In a relational database, a collection of tables can be related or joined to each other through a common field or key, which enables information in one table to be automatically cross-referenced to corresponding information in another table.
A complex search may be performed on a database with a query. A query specifies a set of criteria (e.g., the quantity of parts from a particular transaction) to define identified information for a database program to retrieve from the database. An aggregate query is a query that requests information concerning a selected group of records. For example, in a database which stores sales transactions, an aggregate query may request the total quantity of an item in a particular transaction. Each aggregate query may include a set of criteria to select records (e.g., grouping of records by an item code field and a transaction code field), and an operation to perform on the group of selected records (e.g., summing the quantity fields). Typical operations for aggregate queries include counting, summing, averaging, and finding minimum and maximum values.
To perform an aggregate query, a conventional database program examines every record in the database to determine whether or not the record matches any criteria and constructs a query table from the records that match the criteria. Then the program performs the required operation over the appropriate fields from each record in the query table.
Massive data streams are increasingly prevalent in many real-time applications, such as web applications, Internet-traffic monitoring, telecommunication-data management, financial applications, and sensor networks. Often, the data streams in these applications are distributed across many locations, and it is important to be able to answer aggregate queries that pool information from multiple locations. Given continuous data feeds to support real-time decision making in mission-critical applications, such as fraud and anomaly detection, these queries are typically evaluated continuously, in an online fashion. For example, in a high-speed network with many nodes, packet streams arrive at and depart from the nodes on a continuous basis. A quantity that is of importance for many network-management applications, such as optimization and fault management, is a traffic matrix, which is a representation of the volume of traffic (typically in packets or bytes) that flows between origin-destination (OD) node pairs in a communication network during a measurement interval. A traffic matrix varies over time, and a sudden change may indicate an underlying anomaly.
In some circumstances, such as the monitoring of network traffic that includes high-speed and/or high-volume data streams, aggregate querying, as performed by conventional database programs, may be unacceptably slow. In such circumstances, exact computation for aggregate queries can be difficult to carry out, due to large memory requirements.
The term “set expression” refers to an expression that defines a set of data elements and is made up of set identifiers (i.e., names of sets) and set operations (such as complements, unions, intersections, and differences) performed on those sets. Each data element may be, e.g., an individual byte of data or a record containing multiple bytes of data. The terms “stream expression” and “data stream,” as used herein, refer to a set expression defined over multiple streams (such as streams of data passing through different nodes of a network), where each stream is considered as a set of elements. Since, in a given stream expression, elements may appear more than once, the term “stream-expression cardinality” refers to the number of distinct elements in a stream expression.
For example, in the Venn diagram of
In one embodiment, the present invention provides a method of monitoring a network. The method includes, at each node of a set, constructing a corresponding vector of M components based on a stream of data packets received at the node during a time period, the set including a plurality of nodes of the network, M being greater than 1; and estimating a value of a byte traffic produced by a part of the packets based on the constructed vectors, the part being the packets received by every node of the set. The constructing includes updating a component of the vector corresponding to one of the nodes in response to the one of the nodes receiving a data packet. The updating includes selecting a component of the vector to be updated by hashing a property of the received data packet.
Embodiments of the present invention will now be discussed in further detail in the following sequence.
First, a Quasi-Maximum Likelihood Estimation (QMLE) estimator of the aggregate query will be proposed, which is near-optimal in terms of statistical efficiency, without requiring any prior knowledge of the actual distribution of the attribute values to be aggregated. Such a QMLE estimator constructs and employs, for each of the two data streams, vectors of M components, where M>1. The vectors are compact representations of the actual elements in the streams. The near-optimality implies that algorithms consistent with embodiments of the invention can yield highly accurate estimates given a small amount of memory. A QMLE estimator is also scale-free, in the sense that the approximation error of the estimator is independent of unknown data-stream volumes.
Second, a new vector-generating algorithm for approximately answering aggregate queries over two data streams will be presented.
Theoretical analysis has shown that, with the same memory requirement, this approach has superior performance to those of the prior art, and the relative error of a QMLE estimator scales linearly with the square root of the noise-to-signal ratios, while prior-art approaches scale linearly with noise-to-signal ratios.
An embodiment of a QMLE scheme is a traffic-matrix estimation problem in a high-speed network, where the objective is to accurately estimate total traffic volume (e.g., in bytes) between origin and destination nodes in the network, using byte streams observed at individual nodes. This embodiment may be used in the system of
For a pair of high-volume data streams, each record in the stream is composed of an identifier and value pair, (i, ν), where each identifier i is unique in each data stream, and ν has a finite variance. In practice, such a constraint on attributes (i, ν) is satisfied or approximately satisfied in many situations. For example, in the traffic-matrix estimation problem, duplicate packets constitute a very small percentage of the total traffic (typically less than 2%), and packet sizes are bounded, usually between 1 and 1500 packets.
Embodiments of the present invention provide schemes for providing approximate answers to aggregate queries (e.g., sum queries, count queries, and average queries) of ν over the pair of data streams in a time interval, using the identifier i as the equi-join attribute (a value for i that is compared based only on whether or not it is equal to another value i from a different record). Such schemes are based on vectors, each vector being a compact synopsis that can be generated, for each data stream, with little processing overhead. If a sum query is being answered, and ν is always positive, then such schemes can be generalized to other situations.
Each vector is a hash array of size M generated as follows. For each incoming record (i, ν) in the stream, the record is first hashed to a bucket using its identifier i as a hash key, and then a value g(i) is computed, where g(•) is a unit-exponential random-number generator using i as its seed value. Each bucket then stores the minimum value of g(i)/ν for all records hashed to the bucket. At the end of each measurement interval, the vectors for both data streams are routed to a centralized location, where an accurate estimate of the sum query is obtained using a QMLE method, as will be described in further detail below. Such algorithms are scale-free in the sense that, for a given level of approximation accuracy, M is independent of unknown data-stream volumes.
A feature of certain embodiments of the present invention is the development of a likelihood-based inference, based on a new statistical model of vectors. As a result, estimates are generated that are highly efficient and scale well with high noise-to-signal ratios. Furthermore, an accurate approximation of the distribution of relative-estimation error can be derived, which provides a much more informative characterization of the error distribution than loose probability bounds.
While other solutions for count-query computation over data streams have previously been proposed based on hashing and extreme values of a randomization function applied to each data stream, in embodiments of the present invention, the distribution of ν is unknown in advance, and hence, the distribution of the hash-array values for the vector-generating algorithm is not exactly known. QMLE methods consistent with embodiments of the present invention account for this uncertainty and yield near-optimal estimates in terms of statistical efficiency.
The following additional notations will be used herein. The expression P(•) represents a probability function. The expressions E(•) and var(•) represent the expectation and variance of a random variable, respectively. The expressions corr(•,•) and cov(•,•) represent correlation and covariance, respectively, between two random variables, and the expression
represents a convergence in distribution. The expression means a definition, and the expression a≈b is equivalent to a/b≈1, where the operator represents an approximation of equality. The operators ∪, ∩, and \ represent set union, set intersection, and set difference, respectively.
An introduction to the problem, including a streaming algorithm for vector generation of an individual stream, will now be provided.
The expressions τ1, τ2 represent two data streams, where each element is composed of an identifier and value pair (i, ν). Assuming that there is no duplicate identifier i in each of the data streams in a given time interval, and that the attribute value ν has a finite variance (the finite variance assumption is satisfied, e.g., if ν is bounded), approximate answers are sought for:
V=Σ
(i,ν)ετ
∩τ
ν(sum query),
C=Σ
(i,ν)ετ
∩τ
1(sum query),
A=V/C(average query). (1)
Here, τ1∩τ2 denotes the intersection of τ1 and τ2. In exemplary database language, the expressions V, C, A represent the result of aggregate queries for attribute value ν of the equi-join of two data streams τ1, τ2 using identifier i.
Of particular interest is a scenario in which τ1, τ2 are of very high volume, e.g., containing millions or even billions of records. A highly accurate and scale-free estimate is desirable, where “scale-free” implies that the approximation errors of the underlying algorithm are independent of unknown volumes. A vector-generating algorithm consistent with embodiments of the invention is applied to each data stream to achieve this goal. A vector-generating algorithm designed to answer the sum queries will first be presented, and then the cases of the count query and average query will be discussed.
Without the loss of generality, it is assumed that attribute value ν is always positive (zero values of ν can be ignored, since such values do not contribute to the sum). If one or more ν values are not positive, then the values of ν can always be divided into two groups, one for the positive values of ν, and one or the negative values of ν, and a single sum query will then be converted into two sum queries, one for the positive values of ν, and the other for the negative values of ν.
As shown in the flowchart of
In Algorithm 1, steps 1, 2, 3, and 4 correspond to steps 410, 450, 440, and 460 of
The expression N represents the total number of records in stream τ, and λ=N/M represents the average number of records in each bucket. For the kth bucket, 1≦k≦M, if there are Bk records hashed into the bucket, i.e., (ik,l,νk,l), l=1, . . . , Bk, and for each record (ik,l,νk,l), l=1, . . . , Bk, the expression Rk,l=g(ik,l), then
is the total sum of attribute values in bucket k, and it is assumed that attribute value ν is always positive, then it can be seen that, when Bk≧1,
for some values of Rk, which is a unit-exponential random variable. Therefore, the following equation for Y[k] can be derived:
where Rk is a unit exponential, and Vk follows Equation (3) shown above.
Since the attribute values are not stored in hash array Y[k], 1≦k≦M, the exact distribution of Y[k] is unknown. However, an approximate distribution of Y[k] when λ is large can be obtained as follows. If ν is a random variable generated from an unknown distribution F with mean ν and variance κν2, and κ is the ratio of variance-to-mean square, i.e., the square of the coefficient of variance, then it can be seen that Bk follows a binomial distribution Binomial(N, 1/M), which is approximately a Poisson discrete-probability distribution Poisson(λ) for large values of N, M. It can be shown that
E[Vk]=λν, Var[Vk]=(1+κ)ν2. (5)
and it can be further verified that, as λ→∞, almost surely,
and in distribution,
Therefore, Y[k] approximates an exponential distribution with rate λν. It is further noted that the values of Rk, k=1, . . . , M, are independent, and the values of Bk, k=1, . . . , M, are approximately independent when N, M are large, and hence, the values of Y[k], 1≦k≦M, are approximately independent. The following Lemma 1 states a mathematical relationship characterizing the statistical properties of Y[k]:
V
k≈Gamma(α,β), (7)
where α is a shape parameter and β is a scale parameter of the Gamma distribution. A sum of independent random variables Vk can be approximated using a Gamma distribution with a large shape parameter, as well as a Normal distribution. However, the Gamma distribution is positive and the Normal distribution is not, and therefore, a Gamma approximation is more desirable in practice. By equating the first and second moments of the Gamma distribution with those of Vk, it can be seen that the shape and scale parameters (α,β) of the Gamma distribution approximation are
Using the traffic-matrix estimation problem as an example in simulations, it has been shown that the Gamma distribution provides a good approximation for λ, even when λ is as small as 5.
It might be assumed that Vk follows a Gamma distribution with shape α and scale β. However, since this assumption might not be exactly true, the variables PQ and EQ will be used to denote the probability and expectation, respectively, based on this assumed distribution for Vk, which will be referred to as a “quasi-likelihood” method. Then, by using the characteristic function of the Gamma distribution, the following equation holds true:
P
Q(Y[k]≧y)=EQ[e−yV
In other words, Y[k] follows a generalized-Pareto distribution Pareto(α,β).
The cases of the count query and average query will now be discussed. A count query returns a cardinality, e.g., the number of records (or, in other embodiments, bytes) in a data stream. To provide an approximate answer for the count query in Equation (1), a hash array can be designed for each stream, in a manner similar to Algorithm 1, by replacing ν with a constant value of 1. This has been proposed in the prior art, wherein a Maximum-Likelihood Estimate (MLE) is derived, e.g., as described in Chen et al., “A simple and efficient estimation method for stream expression cardinalities,” Proceedings of the 33rd International Conference on Very Large Data Bases (VLDB), 2007, which is incorporated herein by reference in its entirety. As shown in the flowchart of
It is noted that Algorithm 2 is substantially similar to Algorithm 1 discussed above, except that an update of Y[h(i)] is performed using Y[h(i)]←k min(Y[h(i)],g(i)), instead of Y[h(i)]←min(Y[h(i)],g(i)/ν). In the scenario of Algorithm 2, the distribution of Y[k] is a truncated exponential distribution with rate λ, which also implies that Y[k] is approximately exponential with rate λ when λ is large. Thus, a new Quasi-Maximum Likelihood Estimation (QMLE) method, which will be described below, can also be applied for the count query, and its performance will be similar to that of an MLE method.
To answer the average query, two hash arrays are used for each data stream, one for computing the sum query, and the other for computing the count query. The result of the average query is simply the division of the sum query by the count query. Accurately estimating the sum aggregation will be discussed in further detail below.
The details of a Quasi-Maximum Likelihood Estimation (QMLE) method for aggregation will now be discussed. Arrays (Y1[k], Y2[k]), k=1, . . . , M, are a pair of hash arrays that store the pseudo-random vectors applied to streams τ1 and τ2, respectively, using Algorithm 1 with the same functions h and g. The variables Λ1, Λ2, and μ represent the mean of the total attribute sums for records in streams τ1, τ2, and τ1∩τ2, respectively, which are hashed to an arbitrary bucket. It is noted that Λ1 and Λ2 can be treated as known quantities using the total sums of attribute values in the two streams divided by M. It is further noted that Mμ is the answer to the aggregate query in Equation (1). A near-optimal statistical method for estimating μ, using (Y1[k], Y2[k]), k=1, . . . , M, will be presented below.
For simplicity, it is assumed that the attribute values in both streams τ1, τ2 are generated from the same unknown distribution F, with mean ν and variance κν2. In simulation studies, it has been shown that such a simplification does not substantially alter the quality of the estimates obtained from a QMLE method.
It is noted that, in general, attribute distribution F is unknown. In statistical terms, F is a nuisance parameter for estimating μ, the parameter of interest. Since F is unknown, the exact distribution of (Y1[k], Y2[k]) is also unknown. Therefore, the usual MLE method, which is well-known to be most efficient in the statistics literature, cannot be used. However, by using a Gamma approximation, as in Equation (7), a Quasi-Maximum Likelihood Estimation (QMLE) method for estimating μ can yield a near-optimal estimate in terms of statistical efficiency.
The following additional notation will be used in the explanation below. In generating (Y1[k], Y2[k]), k=1, . . . , M, the expressions X[k], X1[k], X2[k], k=1, . . . , M, represent vectors corresponding to streams τ1∩τ2, τ1\τ2, and τ2\τ1, respectively. Expressions X[k], X1[k], X2[k], k=1, . . . , M, are also obtained using Algorithm 1 but are clearly unobservable. It can be seen that
Y
i
[k]=min(X[k],Xi[k]), i=1, 2.
The variables λ, λ1, λ2 represent the average number of records hashed into each bucket, which belong to streams τ1∩τ2, τ1\τ2, respectively. As described above, each of X[k], X1[k], X2[k] can be approximated by a generalized-Pareto distribution when λ, λ1, λ2 are large. Therefore, to estimate μ, the likelihood function of (Y1[k], Y2[k]) can be derived by treating the generalized-Pareto distributions of X[k], X1[k], X2[k] as true models, i.e.,
X≈Pereto(α,β), Xi≈Pereto(αi,β). (9)
Then, μ is derived using the usual MLE method. Since the assumed models in Equation (9) might not be true, such a method is referred to in the statistics literature as a “Quasi-” Maximum-Likelihood Estimation (QMLE) method. Typically, there will be a bias for a QMLE estimate, which bias will be shown below to be negligible for large values of λ+λ1+λ2, and a QMLE estimate will be shown to be near-optimal in terms of statistical efficiency.
The variable PQ denotes the quasi-probability based on this Pareto-distribution assumption. The quasi-likelihood can be derived as follows. For y1≧0, y2≧0, the following equations are true:
Thus, a quasi-density function, PQ(Y1=y1, Y2=y2), can be written as
The variable mi represents the proportion of Case i, i=1, 2, 3, in the M-obtained bucket pairs, (Y1[k], Y2[k]), k=1, . . . , M. It can be observed that, for some bucket pairs, it is possible that one or even both of the values will be ∞ if no records are hashed into one or both buckets in the pair. The expression Yi* is defined as Yi*=YiI(Yi<∞), i=1, 2. From Equation (8), the relations μ=αβ and κi=(α+αi)β, i=1, 2, can be derived. By using this transformation, the quasi-likelihood function can be expressed as a function of β, μ, Λ1, Λ2, where Λ1, Λ2 can be treated as known quantities, as explained previously. The following Statement 1 provides the average quasi-log likelihood of the pair of hash arrays, (Y1[k], Y2[k]), k=1, . . . , M.
Statement 1:
Therefore, as shown in
Thus, a new Quasi-Maximum Likelihood Estimation (QMLE) method has been developed in response to an increased focus on probabilistic-query algorithms for large-volume data streams. Embodiments of the present invention further address the problem of aggregate queries, such as sum, count, and average, and provide new algorithms based on minimum statistics and quasi-likelihood inference. A QMLE method consistent with certain embodiments of the invention is near-optimal in terms of statistical efficiency. Both theoretical analysis and empirical studies have shown that, with the same memory requirement, a QMLE method has significantly superior performance relative to existing perturbation methods, particularly when noise-to-signal ratios are large. A method for monitoring bytes among a pair of network nodes by using a QMLE estimator, e.g., for the purpose of detecting network-traffic anomalies, has also been presented.
It has been demonstrated empirically that, for a pair of high-volume data streams, estimation algorithms consistent with certain embodiments of the present invention yield more accurate estimates of the aggregate queries than existing approaches, using the same amount of memory.
The terms “Quasi-Maximum Likelihood Estimation” and “QMLE” should be understood to include the particular implementations and embodiments explicitly described herein, as well as other possible implementations and embodiments.
While embodiments of the invention disclosed herein use certain methods for generating statistical vectors, as shown in
The present invention has applicability in monitoring traffic in different environments and comprising data streams of different types, including not only traditional-network (e.g., hardwired LAN) data streams, but also, e.g., wireless-network data streams, sensor-network data streams, and financial-application data streams. A QMLE scheme consistent with certain embodiments of the invention can be used to estimate both the number of bytes and the number of packets in stream expressions over a pair of individual nodes in a network.
The term “random,” as used herein, should not be construed as being limited to pure random selections or number generations, but should be understood to include pseudo-random, including seed-based selections or number generations, as well as other selection or number generation methods that might simulate randomness but are not actually random, or do not even attempt to simulate randomness.
The present invention may be implemented as circuit-based processes, including possible implementation as a single integrated circuit (such as an ASIC or an FPGA), a multi-chip module, a single card, or a multi-card circuit pack. As would be apparent to one skilled in the art, various functions of circuit elements may also be implemented as processing blocks in a software program. Such software may be employed in, for example, a digital signal processor, micro-controller, or general-purpose computer.
The present invention can be embodied in the form of methods and apparatuses for practicing those methods. The present invention can also be embodied in the form of data-storage media, such as magnetic recording media, optical recording media, solid state memory, floppy diskettes, CD-ROMs, hard drives, or any other machine-readable data-storage medium storing machine-readable program code, wherein the program code includes a set of instructions for executing one of the inventive methods on a digital data-processing machine, such as a computer, to perform the method. The present invention can also be embodied in the form of program code, for example, whether stored in a storage medium, loaded into and/or executed by a machine, or transmitted over some transmission medium or carrier, such as over electrical wiring or cabling, through fiber optics, or via electromagnetic radiation, wherein, when the program code is loaded into and executed by a machine, such as a computer, the machine becomes an apparatus for practicing the invention. When implemented on a general-purpose processor, the program code segments combine with the processor to provide a unique device that operates analogously to specific logic circuits. Unless explicitly stated otherwise, each numerical value and range should be interpreted as being approximate as if the word “about” or “approximately” preceded the value of the value or range.
It will be further understood that various changes in the details, materials, and arrangements of the parts which have been described and illustrated in order to explain the nature of this invention may be made by those skilled in the art without departing from the scope of the invention as expressed in the following claims.
It should be understood that the steps of the exemplary methods set forth herein are not necessarily required to be performed in the order described, and the order of the steps of such methods should be understood to be merely exemplary. Likewise, additional steps may be included in such methods, and certain steps may be omitted or combined, in methods consistent with various embodiments of the present invention.
Although the elements in the following method claims, if any, are recited in a particular sequence with corresponding labeling, unless the claim recitations otherwise imply a particular sequence for implementing some or all of those elements, those elements are not necessarily intended to be limited to being implemented in that particular sequence.
Reference herein to “one embodiment” or “an embodiment” means that a particular feature, structure, or characteristic described in connection with the embodiment can be included in at least one embodiment of the invention. The appearances of the phrase “in one embodiment” in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments necessarily mutually exclusive of other embodiments. The same applies to the term “implementation.”
This application is related to U.S. application Ser. No. ______, filed on the same date as this application as attorney docket no. Bu 12-6-6, the teachings of which are incorporated herein by reference.