1. Field of the Invention
The present invention relates to traffic analysis in a network.
2. Description of the Related Art
Massive and distributed data streams are increasingly prevalent in many modern applications. In a backbone Internet-Protocol (IP) network composed of hundreds or even thousands of nodes, packets arrive at and depart from the nodes at very high speeds. In a web content-delivery system composed of many servers (such as Akamai), user requests for accessing websites are distributed among the many servers based on the location of the user and current server loads. Other application domains that give rise to these massive and distributed streams include financial applications and sensor networks.
Due to their massive and distributed nature, answering queries about these data streams poses a unique challenge. Often, exact-query answering is infeasible due to memory requirements and communications overhead. In this scenario, approximate-query answering, which can provide probabilistic guarantees, becomes the only viable option. One of the most fundamental query classes of interest is the estimation of the number of flows in such streams.
As a first example, in the context of IP-network management, the number of distinct flows in a network sharing the same characteristics is of high interest to network operators, where a packet flow is defined as, e.g., a sequence of packets that have the same 5-tuple (a logical construct containing five parameters used to identify the connection and allowing network packets of data to be communicated between a server process and a client process in a bi-directional fashion), the same IP addresses/ports of the two communicating peers, and the same protocol. Moreover, the flow ID of a packet can be derived from the 5-tuple. The number of distinct flows between a node pair x and y, which is a type of traffic matrix element, can be formulated as the number of flows of
where
are the numbers of streams of packet-flow IDs seen at nodes x and y, respectively. Such traffic matrix elements can be used by network operators for network provisioning and optimization.
A second example is the total number of distinct flows to the same destination node i, i.e., Ui, where are the streams of packet-flow IDs to node i. A significant increase in Ui may indicate an underlying network anomaly, such as a Denial of Service (DoS) attack.
The term “set expression” refers to an expression that defines a set of data elements and is made up of set identifiers (i.e., names of sets) and set operations (such as complements, unions, intersections, and differences) performed on those sets. The term “stream expression” refers to a set expression defined over multiple streams (such as streams of data passing through different nodes of a network), where each stream is considered as a set of elements. Since, in a given stream expression, elements may appear more than once, the term “stream-expression cardinality” refers to the number of distinct elements in a stream expression. For example, in the Venn diagram of
In one embodiment, the present invention provides a method of monitoring a network. The method includes, at each node of a fixed set, constructing a corresponding vector of M components based on data packets received at the node during a time period, M being an integer greater than 1, the fixed set being formed of some nodes of the network; and, based on the constructed vectors, estimating how many of the received data packets have been received by all of the nodes of the set or estimating how many flows of the received data packets have data packets that have passed through all of the nodes of the set. The constructing includes updating a component of the vector of one of the nodes in response to the one of the nodes receiving a data packet. The updating includes selecting the component for updating by hashing a property of the data packet received by the one of the nodes.
Estimation methods for stream-expression cardinalities, consistent with embodiments of the present invention, will now be discussed in further detail in the following sequence. First, a method for monitoring flows or packets among multiple network nodes by using a Proportional-Union Estimation (PUE) estimator consistent with one embodiment of the invention will be described. Second, some basic statistical concepts and a continuous variant of Flajolet-Martin (FM) vectors on which a PUE method is based will be discussed. Third, a PUE estimator will be developed from the continuous FM vectors, and its performance will be characterized analytically. Fourth, the statistical efficiency of a PUE estimator will be analyzed by comparison with the MLE for expressions over two streams.
An embodiment of a PUE method is a traffic-matrix estimation problem in a high-speed network, where the objective is to accurately estimate total traffic volume (in flows or packets) between origin and destination nodes in the network, using flows or packet streams observed at individual nodes. This embodiment may be used in the system of
Some terminology used herein will now be introduced. The expression S is used herein to denote a stream expression of interest being studied, and N is the cardinality of the union of all streams. The variable δ represents a specified desirable value of relative standard error, which is a measure of an estimate's reliability obtained by dividing the standard error of the estimate by the estimate itself.
In the context of a stream of network flows, a stream expression passes through an arbitrary combination of some network nodes but not others. In general, distributed streams under consideration will be denoted by , 1≦j≦J, where J is the number of nodes in the network. The operators ∪, ∩, and \ represent set union, set intersection, and set difference, respectively. The operator |•| is used to denote the cardinality of a set. For a stream expression that involves an arbitrary combination of unions, intersections, or differences of , 1≦j≦J, e.g.,
estimating the cardinality of the set is done to provide probabilistic guarantees of estimation accuracy, which permits minimizing computational overhead and memory usage.
The following additional notations will be used herein. The expression P(•) represents a probability function. The expressions E(•) and var(•) represent the expectation and variance of a random variable, respectively. The expressions corr(•,•) and cov(•,•) represent correlation and covariance, respectively, between two random variables, and the expression
represents a convergence in distribution. The expression z,36 means a definition, and the expression a≈b is equivalent to a/b≈1, where the operator represents an approximation of equality. The expression Exp(r) (also written as er) is an exponential distribution with rate parameter r, and Normal(μ, σ2) represents a Gaussian distribution with mean μ and variance σ. Finally, for a stream expression S over J streams, the variable N represents the cardinality of the total stream union, and ps=|S|/N represents the proportion of S in N, which proportion is also referred to herein as simply p.
A brief introduction of some statistical concepts that are used herein, including the notion of statistical efficiency, will now be provided. The expression fθ(x) is a probability-density function for a continuous random variable x (or a probability-mass function for a discrete random variable x), parameterized by θ. Supposing a random sample of size n is drawn from fθ(•), i.e., x1, x2, . . . , xn, the probability density associated with the observed data is fθ(x1, . . . , xn)=Πi=1n fθ(xi). As a function of θ with x1, x2, . . . , xn observed, this is, in fact, the likelihood function L(θ) of θ, i.e.,
L(θ)=Πi=1nfθ(xi).
The expression I(θ) is defined by
and {circumflex over (θ)} is an unbiased estimator of θ based on the given sample x1, x2, . . . , xn. The variance of {circumflex over (θ)} is then bounded by the reciprocal of the Fisher information I(θ), i.e.,
The foregoing inequality is the well-known Cramer-Rao inequality, also known as the Cramer-Rao lower bound (CRB), as described in Bickel et al., Mathematical Statistics: Basic Ideas and Selected Topics, Vol 1, (2nd Edition), Prentice Hall, 2000, which is incorporated herein by reference in its entirety. The efficiency of {circumflex over (θ)} is defined using the Cramer-Rao lower bound by
From the Cramer-Rao inequality, eff ({circumflex over (θ)})≦1. A desirable statistical inference method seeks an efficient estimator {circumflex over (θ)} that has a large value eff ({circumflex over (θ)}) of efficiency. One such method is the popular Maximum-Likelihood Estimation (MLE) method. The MLE of θ, {circumflex over (θ)}MLE, is defined by finding the value of θ that maximizes L(θ), i.e., {circumflex over (θ)}MLE=argmaxθL(θ), where the function argmax represents the argument of the maximum. As the number of samples increases to infinity, the MLE is asymptotically unbiased and efficient, i.e., to achieve the Cramer-Rao lower bound.
The cardinality of distributed streams can be estimated using vectors that are compact representations of the actual elements in the streams of the streams. The vectors have M components, where M>1. Such vectors may also be referred to as “sketches,” “statistical sketches,” “statistical digests,” “digests,” or “minimal statistics.” Such vectors can be, e.g., sampling-based or hash-based probabilistic vectors. These probabilistic vector-based solutions, which largely focus on deriving novel algorithms for deriving the vectors, include the Flajolet-Martin (FM) estimator, proposed for calculating the cardinality of a single stream, as disclosed in Flajolet et al., “Probabilistic counting,” In Proc. Symp. on Foundations of Computer Science (FOCS), 1983, which is incorporated herein by reference in its entirety.
A continuous variant of the Flajolet-Martin (FM) vector, which is used to develop an efficient PUE estimator for stream-expression cardinalities, will now be described. Flajolet and Martin proposed an estimator for counting distinct values in a single stream, using a hash-based probabilistic vector with O(log N) space, where N is the true cardinality. In the original version of the Flajolet-Martin (FM) algorithm for generating vectors, a hash function is used to map an element in the stream to an integer that follows a geometric distribution. In embodiments of the present invention, a continuous variant of the FM algorithm is developed by replacing the geometric random number with a uniform random number. The continuous variant is used to simplify the statistical analysis, as will be discussed below.
To generate independent replicates of statistics used for counting cardinalities, a technique referred to as stochastic averaging is employed, as described in Durand et al., “Loglog counting of large cardinalities,” Proceedings of European Symposium on Algorithms, pages 605-617, 2003, which is incorporated herein by reference in its entirety. In stochastic averaging, elements are randomly distributed over an array of buckets. For a single stream , the expression [M] represents the data domain of its element e. In the example of IP-flow counting, e is the packet-flow ID, and [M] is the set of all possible values of flow IDs. As shown in the flowchart of
Y[k]→min(Y[k], g(e)). (3)
At step 550, a determination is made whether additional elements e exist, in which case the method returns to step 520. If, at step 550, it is determined that no additional elements exist, then the method proceeds to step 560, wherein hash array Y is returned as a result. A bucket value will remain at 1 if no element is hashed to that bucket. The following exemplary pseudo-code (Algorithm 1) may be used to implement continuous FM-vector generation for stream :
In Algorithm 1, steps 1, 2, 3, 4, 5, and 6 correspond to steps 510, 550, 520, 530, 540, and 560 of
By ignoring the weak dependence among {Y[k], 1≦k≦m}, the likelihood function of μ can be written as
where I(•) is an indicator function. Thus, the MLE of μ, and hence ||, are given by:
It is assumed above that two universal hash functions h and g are available for producing random independent numbers. To be more realistic, t-wise independent hashing, which employs additional storage for storing a seed, could alternatively be used.
The development of a new, Proportional-Union Estimation (PUE) method for estimating the cardinality of a stream expression using continuous FM vectors will now be described. To facilitate the clarity of the following description, a PUE method will first be demonstrated for a set expression over two streams and then this method will be generalized to an arbitrary number of streams ,j=1, . . . , J.
For each stream , j=1,2, the expression Yj[k], 1≦k≦m represents the corresponding continuous FM vector, obtained using Algorithm 1 with the same hash functions h and g. (For ease of reference, the “[k]” portion of the expression will be omitted when referring to these vectors at a bucket location k.) As shown in
The following Statement 1 enables a proportional-union estimate for all three cardinalities, |∩, \, and \|:
STATEMENT 1:
Suppose that m is large enough such that
(1−m−1)m≈e−1. Let N=|∪|, and ε=exp(−N/m).
P(Y1=Y2)≈(1−ε)N−1| ∩ |,
P(Y1<Y2)≈(1−ε)N−1|T1\T2|,
P(Y1>Y2)≈(1−ε)N−1|T2\T1|.
For example, if N is large enough such that c is negligible, then Statement 1 permits the conclusion that | ∩|≈NP(Y1=Y2). Therefore, an estimate of ∩ can be obtained using the product of estimates of | ∪ T2| and P(Y1=Y2). It is noted that the continuous FM vector (Algorithm 1) of the stream union ∪ is exactly the bucket-wise minimum min(Y1, Y2), and therefore, | ∪ | can be estimated by using Equation (4) on the new vector. Furthermore, P(Y1=Y2) can be estimated empirically from the observed vector-pair (Y1,Y2).
If {circumflex over (N)} is the estimate of N by using Equation (4) on the continuous FM vector of the stream union ∪ , defined by min(Y1, Y2), then, for a large N such that ε is negligible, the cardinalities ∪ , |\|, and |\| can be estimated by
respectively, where {circumflex over (P)} represents the empirical probabilities based on the observed vector pair (Y1[k], Y2[k]), k=1, . . . ,m. In other words, the first line of Equation (6) states that an estimate for ∩ can be derived by multiplying the estimate {circumflex over (N)} of stream union N by the estimated probability that Y1=Y2. The second line of Equation (6) states that an estimate for |\| can be derived by multiplying the estimate {circumflex over (N)} of stream union N by the estimated probability that Y1<Y2. The third line of Equation (6) states that an estimate for |\| can be derived by multiplying the estimate {circumflex over (N)} of stream union N by the estimated probability that Y2<Y1. Equation (6) can be used to obtain an estimate of either (i) the number of received data packets that have been received by all of the nodes of the set, or (ii) the number of flows within received data packets that have passed through all of the nodes of the set. The foregoing estimation scheme or method is referred to as a Proportional-Union Estimator (PUE) because the estimation is based on proportions of subsets of a stream union.
If ε=exp(−N/m) is not negligible, then Equation (5) can be inverted to obtain a PUE estimate of the cardinalities.
A PUE method will now be generalized to estimate the cardinality of a set expression over multiple streams , j=1, . . . , J, with J≧2. The expression Yj[k], k=1, . . . ,m represents the corresponding continuous FM vectors by Algorithm 1 for stream . As before, for ease of reference, the “[k]” portion of the expression will be omitted when referring to these vectors at a bucket location k. The expression Y∩ represents the continuous FM vector for the stream union ∪j=1J and is defined as:
Y
∪=min(Y1, . . . ,YJ). (7)
The following Statement 2 is an extension of Statement 1 on P(Y1=Y2) to the case of multiple streams:
STATEMENT 2:
Suppose m is large enough such that (1−m−1)m≈e−1.
Let N=|∪j=1J, and e=exp(−N/m).
Then for 1≦d≦J,
P(Y1=Y2= . . . =Yd=Y∩)≈e+(1−e)N−1|∩j=1d. (8)
Supposing that S is a set expression over J streams, whose cardinality is the subject of interest, to complete the generalization of a PUE method for |S|, two additional techniques are used for dealing with set expressions, which is illustrated by the following example. Considering S=\((∩ )∪ ), the first technique is to remove set differences that appear in the expression using the relation
\B|=||−|∩|.
It is noted that this relation can be used repeatedly if there are multiple set differences. In this example, this implies
|\((∩ )∪ )|=||−|( ∩((∩) ∪))|.
Without loss of generality, it can be assumed that the set expression only involves unions and intersections. The second technique is to rewrite the set expression in terms of intersections of set unions, i.e.,
(∩)∪=(∪)∩(∪).
In this example, this implies
∩((∩ )∪ )= ∩(∪)∩(∪).
It is noted that the continuous FM vector of a set union of d streams is exactly the minimum of the d individual FM vectors. If =/N is the proportion of in the total union, then, by applying Statement 2 with vectors Yj replaced by the vectors corresponding to the set unions, a close approximation of ps can be derived. In this example, the following expressions hold true (for simplicity, ε is assumed to be ignorable):
N
−1||≈P(Y1=Y∪),
N
−1|∩((∩)∪ )|≈P(Y1=min(Y2,Y4)=min(Y3,Y4)=Y∪),
and thus
=N−1|\((∩)∪ )|≈(P(Y1=Y∪)−P(Y1=min(Y2,Y4)=min(Y3,Y4)=Y∪)).
Assuming {circumflex over (N)} is the estimate of N by Y∪, and is the empirical proportion based on the observed vector corresponding to the tuple of the J streams, if N is large enough such that ε=exp(−N/m) is negligible, then a PUE estimate of can be obtained in a straightforward way as in the two-stream case, by:
={circumflex over (N)}·
It is noted that ε can be estimated by e−{circumflex over (N)}/m. If ε is not negligible, then the proportion equation derived from the above procedure can be inverted to obtain the PUE estimate of correspondingly.
The relative standard error (RSE) for PUE grows linearly with and the RSE of the proportional-union estimator for a set expression S is:
The memory usage of a PUE scheme will now be discussed. For the stream expression S, the variable δ represents a specified value of the relative standard error of a proportional-union estimate of |S|. Given the above equation for calculating RSE, the number of buckets used for a given S is
m≈δ
−2
If the expression E is a unit-exponential random variable, and λj=||/m,j=1, . . . ,J, then the continuous FM vector Yj for is a right-censored exponential, defined by
Yj˜min(λj−1E,1) (9)
Since λj˜O(N), it is implied that Yj uses O(log N) storage bits. The following procedure can reduce the per-bucket storage of vector statistics Yj from log N to log log N, by instead storing log Yj. It is noted that, by Equation (9),
log Yj˜min(0, −log(λj)+log E). (10)
Assuming log log λj is an integer, log λj uses log log λj storage bits, and log Y uses at most log log(N)+a storage bits, where the a bits are used for storing the decimals of log E (for reference, it is noted that the 0.1% and 99.9% quantiles of log E are −6.907 and 1.933, respectively). Therefore, the per-bucket storage is now at most O(log log N), and the total memory used is δ=2(log log N+a).
From experimental studies, it has been observed that a=10 bits is enough for storing the decimals of log E without compromising the overall accuracy of the cardinality estimate. This can be further justified using a careful bias analysis of the probability approximation of e.g., using Equations (5) and (8). For example, considering = ∩ over the J streams, and ==P(Y1=Y2=Y∪), where (I) is the new probability based on the discretized vectors Yj, j=1, . . . ,J, described above, it can be shown that ≦(I)≦+2−a+1. Therefore, if a=10, then the difference in the probabilities is at most 0.002, which is negligible for practical purposes.
It is further noted that, alternatively, a direct method can be used for computing the logarithmic vector log Yj. By Algorithm 1, the following expression is true:
Y
j=min(1, U1, . . . , UB),
where B is a binomial random number representing the number of distinct elements that are mapped to the bucket, and each Ui is a uniform random number in [0,1]. It is noted that −log Ui follows a unit exponential distribution, and therefore,
log Yj=−max(0, −log U1, . . . , −log UB).
Thus, to generate the logarithmic continuous FM vectors, the uniform random number generator g(•) can be replaced by a unit-exponential random-number generator (with the decimal truncated into a bits), and the minimum update can be replaced by a maximum update. This avoids taking the logarithm in the vector generation, and the initial values for the buckets now become 0 instead of 1.
The efficiency of a PUE method will now be discussed, using the formal statistical methods described above. The likelihood of the continuous FM vectors is derived for the case of two streams, the MLE of the cardinality parameters is obtained, and then its asymptotic variance is compared with that of a PUE method. As explained above, MLE is asymptotically most efficient, because it can achieve the Cramer-Rao lower bound. For a set expression whose cardinality is the subject of interest, =||/N is the proportion of the cardinality of in the total union. It will be shown that, in the two-stream case, a PUE method is as efficient as that of MLE when is small.
In addition to the notation used above in the discussion of set expressions over two streams, the following are defined as unknown cardinality parameters:
λ0=|∩|/m, λ1=|\|/m, λ2=|\|/m,
where
θ=(λ0, λ1, λ2)T.
The function fλ(•) denotes the density function of an exponential random variable with rate λ, i.e., fλ(x)=λe−λx, x≧0. The density function for the continuous FM vectors (Y1,Y2), i.e., P(Y1=y1,Y2=y2), 0≦y1,y2≦1, can be expressed as
The vectors at two different bucket locations (Y1[k], Y2[k]) and (Y1[j], Y2[j]) for j≠k, are very weakly dependent. If l(θ) is the negative logarithmic-likelihood function of the continuous FM vectors (Y1[k], Y2[k]), k=1, . . . ,m, i.e.,
then the following Result 1 provides the gradient and Hessian matrix of l(θ) with respect to θ, noting that the expectation of the Hessian matrix is the same as the information matrix l(θ) defined in Equation (1).
RESULT 1:
The gradient
is given by
The Hessian matrix
is a 3×3 symmetric non-negative definite matrix with elements Hij given by:
Unlike a PUE estimate, the MLE of 0 that minimizes l(0) does not have a closed-form solution. By Result 1, l(0) is strictly convex with a probability of almost 1, and hence, its unique minimum can be located using a Newton-Rapson algorithm, e.g., wherein the expression represents the MLE of 0 by minimizing l(0), and the MLE of cardinalities is simply
It has been observed, through simulation and experimental studies, that the Newton-Rapson iteration typically employs only a few steps (less than 5) before convergence is reached. The following Statement 3 provides the asymptotic distribution for the relative accuracy of {circumflex over (θ)}(MLE): STATEMENT 3:
Let ps =I I/N. For A0 >0 and large AO +r , i =152. as Iiim goes to we hav e ,for =∩, − and −,
which is close to 1 for a small value of p,. Therefore, a PUE estimator has almost the same efficiency as the MLE estimator.
For cardinalities defined over a larger number of streams, a PUE method is still relatively simple to implement, while the MLE method becomes difficult, due to the complexity of the likelihood computation.
The estimating may estimate how many of the data packets have propagated to every node of the set. The estimating may estimate how many of the flows have data packets that have passed through all of the nodes of the set. The updating may further include determining a number to assign to the component for updating based on the property of the data packet received by the one of the nodes. The estimating may involve evaluating a correlation between the vectors.
The constructing of the vector of M components may involve updating the number assigned to each component of one of the vectors by a process that changes the assigned number in a monotonic manner.
The estimation performed by the server may estimate how many of the data packets have propagated to every node of the set. The estimation performed by the server may estimate how many of the flows have data packets that have passed through all of the nodes of the set. The updating may further include determining a number to assign to the component for updating based on the property of the data packet received by the one of the nodes. The estimating may involve evaluating a correlation between the vectors.
The constructing of the vector of M components may involve updating the number assigned to each component of one of the vectors by a process that changes the assigned number in a monotonic manner.
The terms “Proportional-Union Estimation” and “PUE” should be understood to include the methods described herein, as well as other implementations and embodiments of proportional-union estimation not specifically set forth herein, and that such terms should not be construed as being limited by the methods described herein.
While embodiments of the invention disclosed herein use an FM method for generating vectors, it should be understood that a PUE estimator consistent with alternative embodiments of the invention could use other methods for generating vectors.
The present invention has applicability in monitoring traffic in different environments and comprising data streams of different types, including not only traditional-network (e.g., hardwired LAN) data streams, but also, e.g., wireless-network data streams, sensor-network data streams, and financial-application data streams.
The term “random,” as used herein, should not be construed as being limited to pure random selections or number generations, but should be understood to include pseudo-random, including seed-based selections or number generations, as well as other selection or number generation methods that might simulate randomness but are not actually random, or do not even attempt to simulate randomness.
The present invention may be implemented as circuit-based processes, including possible implementation as a single integrated circuit (such as an ASIC or an FPGA), a multi-chip module, a single card, or a multi-card circuit pack. As would be apparent to one skilled in the art, various functions of circuit elements may also be implemented as processing blocks in a software program. Such software may be employed in, for example, a digital signal processor, micro-controller, or general-purpose computer.
The present invention can be embodied in the form of methods and apparatuses for practicing those methods. The present invention can also be embodied in the form of data-storage media, such as magnetic recording media, optical recording media, solid state memory, floppy diskettes, CD-ROMs, hard drives, or any other machine-readable data-storage medium storing machine-readable program code, wherein the program code includes a set of instructions for executing one of the inventive methods on a digital data-processing machine, such as a computer, to perform the method. The present invention can also be embodied in the form of program code, for example, whether stored in a storage medium, loaded into and/or executed by a machine, or transmitted over some transmission medium or carrier, such as over electrical wiring or cabling, through fiber optics, or via electromagnetic radiation, wherein, when the program code is loaded into and executed by a machine, such as a computer, the machine becomes an apparatus for practicing the invention. When implemented on a general-purpose processor, the program code segments combine with the processor to provide a unique device that operates analogously to specific logic circuits.
Unless explicitly stated otherwise, each numerical value and range should be interpreted as being approximate as if the word “about” or “approximately” preceded the value of the value or range.
It will be further understood that various changes in the details, materials, and arrangements of the parts which have been described and illustrated in order to explain the nature of this invention may be made by those skilled in the art without departing from the scope of the invention as expressed in the following claims.
It should be understood that the steps of the exemplary methods set forth herein are not necessarily required to be performed in the order described, and the order of the steps of such methods should be understood to be merely exemplary. Likewise, additional steps may be included in such methods, and certain steps may be omitted or combined, in methods consistent with various embodiments of the present invention.
Although the elements in the following method claims, if any, are recited in a particular sequence with corresponding labeling, unless the claim recitations otherwise imply a particular sequence for implementing some or all of those elements, those elements are not necessarily intended to be limited to being implemented in that particular sequence.
Reference herein to “one embodiment” or “an embodiment” means that a particular feature, structure, or characteristic described in connection with the embodiment can be included in at least one embodiment of the invention. The appearances of the phrase “in one embodiment” in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments necessarily mutually exclusive of other embodiments. The same applies to the term “implementation.”
This application is related to U.S. application Ser. No. ______, filed on the same date as this application as attorney docket no. Cao 7-7, the teachings of which are incorporated herein by reference.