This disclosure relates generally to sampling, and, more particularly, to methods, apparatus, and articles of manufacture to sample data connections.
Random sampling is a tool that can be used to facilitate working with large datasets. For example, through random sampling, querying of a full dataset can be replaced by querying of a smaller (and hence easier to store and manipulate) sample of the full dataset. A random sample constitutes a summary obtained by randomly sampling the data in the full data set such that the summary can be used to represent the full dataset.
Example methods, apparatus, and articles of manufacture disclosed herein present a stream sampling method that can handle signed weighted updates. Previous techniques of stream sampling are limited in that they consider unweighted samples or weighted samples having only positive weight. In the illustrated examples presented herein, signed weighted updates of a sample, whether positive or negative, are analyzed and appropriate sampling thresholds for bounding the number of samples stored in a cache are established to generate an estimate for a characteristic of a full dataset.
Prior sampling techniques of large datasets focused on a random access model where the data is static and disk resident. However, in modern applications received data is generally not static, but rather is constantly changing. Thus, the data in such applications can be defined as a stream of transactions, where each transaction modifies the current state of the data.
For example, such a data stream can correspond to a sequence of financial transactions, each of which updates an account balance. In such examples, it may be desirable to be able to maintain a sample over current balances, which describes the overall state of the system, and to use the sample to provide a snapshot of the system against which to quickly test for anomalies without having to traverse the entire account database.
As another example, such a data stream can correspond to records that are inserted in and/or deleted from tables of a database. An example database management system may keep statistics on each attribute within a table, to determine what indices to keep, and how to optimize query processing. Currently, deployed systems track only simple aggregates online (e.g. number of records in a table), and more complex statistics involve a complete scan of the database, which may be unsuitable for (near) real-time systems.
As yet another example, such a data stream can represent network activities at a number of nodes of the network. For example, such network activities could include setting up and/or tearing down data connections, from which an identification can be made of whether the data connections are active or have been terminated. In some examples, other characteristics of the data connections can be additionally or alternatively tracked. Such example characteristics may include, but are not limited to, the number of file transfer protocol (FTP) connections currently in a network, the number of connections to a particular region, the number of connections lasting longer than a particular time (e.g., an hour), etc. It may not be practical for a service provider of a network or data system to centrally keep a complete list of all current open and active connections and/or other network characteristics. Instead, example methods, apparatus, and articles of manufacture (e.g., storage media) for random sampling as disclosed herein can draw a random sample of the connections from which the service provider can determine statistics, for example, on quality of service, round-trip delay, nature of traffic in the network, etc. Such example statistics may be used, for example, to show that agreements between the service provider and its customers are being met, and/or may be used for traffic shaping and planning purposes, etc.
Example methods, apparatus, and articles of manufacture disclosed herein perform random sampling of a data stream in which each data entry of the stream includes a positive or negative update weight Δ associated with one or more keys, i. The keys, i, of the data stream represent respective information elements for which one or more characteristics can be estimated from information provided via the data stream. For example, the keys, i, of the data stream can represent respective different network nodes of a network, different customers whose transactions are stored in a database, different tables or addresses of a database, etc. As such, the keys i represent elements (such as network nodes, etc.) whose characteristic(s) can be monitored using information obtained via the sampled data stream.
In the illustrated examples, samples (e.g., messages) are taken from the data stream that include (or can be mapped to) a key identifier i and corresponding update value Δ. The example key identifier i of the sample is used to determine that the sample is associated with a key i, stored in the cache. The update value Δ represents a change to a data characteristic of the key i (e.g., a data connection at a corresponding node opened).
In the illustrated examples provided herein, the data to be sampled is a stream of updates of the form (i, Δ), where i identifies a particular key to be sampled and Δ∈R, where R is the set of real numbers. In the illustrated examples, a value vi of key is initially 0 and is modified by subsequent updates, but aggregate values of the keys do not become negative. Formally, the value of key i is initially vi=0 and after update (i, Δ), becomes:
v
i←max{0,vi+Δ}.
Some example methods, apparatus, and articles of manufacture disclosed herein maintain an example cache S of keys being monitored and an example count ci for each cached key i∈S. In some examples, a sampling threshold τ, corresponding to the inverse of a sampling rate q (i.e., τ=1/q), is used to determine the number of keys to be kept in the cache S. When an instance of a key i is sampled, if i∈S (i.e., the sampled instance is associated with a key i that is in the cache S), the example counter is adjusted based on the update weight Δ of the key i. If the key i is not in the cache S, then the key i may be added to the cache S based on the update weight Δ and the sampling threshold τ.
In some examples, a bounded sized cache is implemented by cache S that is to store counts ci for at most k keys. Such example methods, apparatus, and articles of manufacture disclosed herein strive to keep the cache S at full capacity of k keys. In some such examples, an effective sampling threshold for entering the key i into the cache S varies. For example, the sampling threshold may increase after processing a sample associated with a key having a positive update but decrease after processing a sample having a negative update. In some such examples, when one or more negative update(s) cause removal of a key from the cache S so there are fewer than k cached keys, the effective sampling threshold becomes zero. In some examples, the count ci of each corresponding cached key i is adjusted based on an adjustment (e.g., an increase) to the sampling threshold τ (e.g., τ+ci). Such example methods, apparatus, and articles of manufacture described herein provide unbiased estimates with bounded variance as a function of the effective sampling rate.
In the illustrated example of
In the illustrated example of
The example monitor 110 of
In some examples, the update information (i, Δ) includes or is accompanied by data connection characteristics (e.g., the type of data connection that was established or that was ended, the start and/or end time of a data connection, location information of the source and/or destination of the data connection, etc.).
The example monitor 110 of
For example, in
In the illustrated example of
The example sample analyzer 220 of
In the illustrated example of
Furthermore, assume that the sampled update is associated with the first node 130 and, thus, the key being analyzed is the key i representing the first node 130. In the illustrated example, the count ci may not be equal to a true data connection value vi of the first node 130, which is two in the illustrated example of
In the illustrated example of
In the illustrated example of
To perform such a determination, in some examples, the cache controller 250 requests a random number from the random number generator 260. The random number generator 260 generates a random number r based on the sampling threshold τ. For example, the random number generator 260 may use the sampling threshold τ as the mean of a probability distribution used to generate the random numbers provided to the cache controller 250. The random number generator 260 provides the cache controller 250 with the random number r.
The example cache controller 250 of
In some examples, if the cache controller 250 determines that the key i is to be added to the cache 240, the cache controller 250 removes an existing key stored in the cache 240 to make room for the new key i. In some such examples, the cache controller 250 maintains a sampling threshold τi for the keys stored in the cache 240. In the illustrated example, the cache controller 250 adjusts the sampling thresholds for the keys in the cache 240 using random numbers from the random number generator 260 when it determines that a new key is to be added. The example cache controller 250 then removes the corresponding key with the lowest adjusted sampling threshold τ, as described below.
The example estimator 270 of
In some examples, the estimator 270 makes an estimate of a characteristic of the data communication system 100 (e.g., the current number of data connections using FTP, the number of data connections opened for at least an hour, the number of data connections currently opened to Europe, etc.). In some such examples the characteristic to be estimated is requested by an example user. In generating an estimate, the example estimator 270 identifies the keys stored in the cache 240 that have the requested characteristic.
The estimator 270 of the illustrated example, calculates a sum s representative of a characteristic based on the keys i in the cache 240 that meet the corresponding characteristic using the sampling threshold τ and the corresponding counts ci, such that s←s+τ+ci (where s is reset to 0 before calculating a new estimate corresponding to the requested characteristic). The sum s is calculated from a sum of all keys in the cache 240 corresponding to the requested status (e.g., the number of open data connections) and/or requested characteristic (e.g., an estimate of the number of open data connections that have been opened for a particular length of time) of the data system 100. In the illustrated example, the estimator 270 of
While an example manner of implementing the monitor 110 of
Flowcharts representative of example machine readable instructions for implementing the monitor 110 of
As mentioned above, the example processes of
Example machine readable instructions 300 that may be executed to implement the monitor 110 of
At block 330, the sample analyzer 220 instructs the counter 230 to adjust the count ci associated with the corresponding key i based on the update weight Δ of the update sample. For example, the update weight Δ can correspond to a +1 if the sampled update corresponds to a sampled message indicating that a connection has been opened at the first node 130 of
In the illustrated example, if the adjusted count ci is greater than zero, at block 340, the counter 230 maintains the corresponding adjusted counts ci for use by the estimator 270 (at block 360 described below) in generating an estimate of the status and/or characteristic of the data communication system 100.
Returning to block 320 of
In the illustrated example of
In the illustrated example, the monitor 110 uses the keys i in the cache and their associated counts ci to make the estimate at block 360 of
At block 420, the cache controller 250 instructs the random number generator 260 to generate a random number r based on the sampling threshold τ. In the illustrated example, random numbers r generated by the random number generator 260 at block 420 are exponentially distributed with a mean substantially equal to the sampling threshold τ. The random number generator 260 provides the cache controller 250 with the random number r at block 420, and control moves to block 430.
At block 430 of
In the illustrated example, at block 430 of
The example machine readable instructions 300, 400 of
In the examples described herein, a key i is associated with a count ci. For keys i that are not stored in the cache 240 (i∉S) the count ci is equal to zero (ci=0). In some examples, the distribution of the count ci of the key i (e.g., corresponding to the first node 130) depends on a true value vi of the first node 130 (e.g., the actual number of open data connections at the first node 130).
It can be shown by the following that in the examples presented herein, if the random number generator 260 employs an exponential distribution with mean τ to generate a random number r, the distribution of ci for a key i with a value vi is:
[v
i−Expτ]+ (1)
where Expτ is a random variable exponentially distributed with mean τ (e.g., the random number r), where τ is the sampling threshold. In the above equation, the notation [x]+ indicates a function max{x,0}.
The distribution of the count ci is determined based on changes in the count ci in response to updates of the value vi. For example, assuming a fixed key i and the set of n updates being {Δ(n):n=0, 1, 2, . . . } then the corresponding values vn are v(0)=0 and v(n+1)=[v(n)+Δ(n)]+. In such examples, the value vi at a given time is the cumulative result of updates for a corresponding key (e.g., the first node 130) that have occurred up to that time. For example, if all four connections are initially closed for the first node 130, then if two connections open (Δ1=+2), followed by one of those connections closing (Δ2−1), the value vi of the first node 130 at that given time is plus one (v3=0+2−1=+1).
In the illustrated example, based on the instructions 300 of
c
(n+1)
=I(c(n)>0)[c(n)+Δ(n)]++I(c(n)=0)[Δ(n)−Exp(n)]+ (2)
where the elements of the set of Exp(n)({Exp(n): n=0, 1, 2, . . . }) are independent and identically distributed random variables of the mean τ, and I is an indicator function (i.e., where I(X)=1 if the condition X is true, and I(X)=0 if the condition X is false). When the count value is equal to zero, c=0, in the illustrated example, the corresponding count c is not maintained for the key in question (i.e., the key i is not in the cache 240). In particular, when the count ci(n)=0, the counter 230 does not maintain a count c for this key prior to an update Δ(n). In the illustrated example, taking a positive part [c(n)+Δ(n)]+ ensures that when the update Δ(n) yields a non-positive count value, that count ci and corresponding key i is removed from the cache 240.
In some examples, for each update n=0, 1, . . . , of a particular count ci=c
c
(n)=d[v(n)−Exp]+ (3)
where =d denotes equality in distribution, and Exp is an exponential random variable of mean τ independent of {Exp(n′):n′≧n}. Assuming the condition in equation (3) is valid, then the count c(n+1) by equation (2) yields {tilde over (c)}(n+1) as follows:
c
(n+1)
={tilde over (c)}
(n+1)
:=I(v(n)>Exp)[v(n)−Exp+Δ(n)]++I(v(n)≦Exp)[Δ(n)−Exp(n)]+. (4)
In the such examples,
{tilde over (c)}
(n+1)=dc′(n+1)=d[[v(n)+Δ(n)]+−Exp′]+ (5)
where Exp′ is an independent copy of Exp. When v(n)+Δ(n)≦0 then c′(n+1)={tilde over (c)}(n+1)=0. When v(n)+Δ(n)>0, the complementary cumulative distribution function (CCDF) of c′(n+1) is (where Pr[X] denotes the probability of X):
The CCDF of {tilde over (c)}(n+1) can be derived from equation (5) as
Pr[{tilde over (c)}
(n+1)
<z]=Pr[Exp<min{v(n),[v(n)+Δ(n)−z]+}]+Pr[v(n)≦Exp]Pr[Exp(n)<[Δ(n)−z]+] (7)
When Δ(n)<z, the first term in equation (6) is
Pr[Exp<[v(n)+Δ(n)−z]+]=Pr[c′(n+1)>z] (8)
and the second term is zero. When Δ(n)≧z, then
From Equation (1), the distribution of the counter ci depends on the true value vi, and can be represented by a truncated exponential distribution when the random number generator 360 employs an exponential distrbution to generate its random number. Accordingly, the above example processes executed by the sampling instructions 300 of
In some examples, each key is assigned an adjusted weight {circumflex over (v)}i, which is equal to the sampling threshold plus the corresonding count (τ+ci), if the sample i is stored in the cache 240 (i∈S), and zero ({circumflex over (v)}i=0) if it is not stored in the cache 240. Using the convention ci=0 for deleted counts, this can be expressed succinctly as
{circumflex over (v)}
i
=I(ci>0)(ci+τ). (10)
In the illustrated example above, the estimator 270 estimates a subset sum Σi:P(i)vi using the subset sum Σi:P(i){circumflex over (v)}i, the latter being the sum of adjusted weights of samples in the cache 240 which satisfy P, wherein P is the status and/or characteristic which the keys should have to be included in the estimate (e.g., being one of the nodes within the data system 100, a member of a group to be monitored or analyzed, the number of data connections that have been opened for a given period of time, etc.). Accordingly, the adjusted weigh {circumflex over (v)}i is an unbiased estimate of the true value vi, as described below.
In some examples, the update weight Δ is a unit update, i.e. where each update weight Δ is either +1 or −1 (Δ∈{−1,+1}). In some such examples, the monitor 110 uses a variant of the foregoing examples where the example comparison of the update weight Δ against an exponential distribution succeeds only when Δ=+1, and does so with probability q=1/τ, where τ is the sampling threshold. In such examples, the count ci for the key i is initialized to zero, ci=0. Further, the count ci is an integer and may be initialized with value greater than zero (ci≧0) or uninitialized (when the key is not cached). Accordingly, if a key i is in the cache 240, then the key remains in the cache 240 if a subsequent sampled update associated with the key i does not decrement the count ci (i.e., the subsequent sampled update does not have a weight Δ=−1). Accordingly, “paired off” increments are erased by decrements, leaving the value vi from the “unpaired” increments. In the illustrated example, the key i remains in the cache 240 if it is sampled during an example unpaired increment. Therefore, the example key i has count ci with probability q(1−q)v
In the preceding examples, the distribution of a final count ci depends on vi. For example, the probability that the key i is cached at termination is 1−exp(−vi/τ). In these examples, with the presence of negative updates in the update stream, a key i may be cached at some point during the execution of the instructions 300 of
In the foregoing examples, the probability that the key i is cached at some point during the execution of the instructions 300 of
In some examples, when ΣΔ+(i)<<τ, the probability that a key gets cached is small (approximately ΣΔ+(i)/τ). Accordingly, the probability that the key is cached at termination is ≈vi/τ. Summing over all samples, a worst-case cache utilization (which is observed after all negative updates occur at the end), is the ratio of the sum of positive updates to the sum of values, ΣΔ+(i):vi.
In some examples, the cache controller 250 bounds the cache 240 to a fixed number of k samples. Example approaches disclosed herein bound the number of cached samples to k by effectively increasing the sampling threshold τ for a key i. In some examples, the cache controller 250 increases the sampling threshold τ0 to a new sampling threshold τ1, where τ1>τ0. Assuming a given key i, the process achieves substantially the same distribution as if the sampling threshold had always been τ1 (rather than originally τ0). In the illustrated example, with a probability q=τ0/τ1, no change is made to the count ci; otherwise, the count ci is reduced based on a random number variable r having an exponential distribution with mean τ1. If the count ci is less than zero (ci<0), the key i is removed from the cache 240.
As an example, let c≧0, 0≦τ0,τ1, u be a random variable uniformly distributed in (0,1), and Expτ
Θ(c,τ0,τ1)=I(uτ1>τ0)[c−Expτ
where Θ(c,τ0,τ1)=c when τ1≦τ0. In this example, the following procedure replaces the count ci with the value Θ(ci,τ0,τ1) if this value is positive. If the value Θ(ci,τ0,τ1) is not positive, the cache controller 250 removes the key i from the cache 240.
The sampling threshold τ may be increased according to the example methods disclosed herein using the following procedure:
In the above foregoing example, replacing the count ci with the defined random variable θ preserves unbiasedness. In particular, the distribution of the updated count ci under θ is substantially equivalent to a fixed-rate Sample and Hold procedure with a sampling threshold τ1.
In the illustrated example, letting 0≦τ0<τ1 and {tilde over (c)}=Θ(c,τ0,τ1), yields:
E[I({tilde over (c)}>0)({tilde over (c)}+τ1)|c]=c+τ0 (1)
if c=d[v−Expτ
where E[X] denotes an expectation function. The above estimate (1) is found by calculating the following:
To determine the above distribution (2), let c=d[v−Expτ
W=I(τ1u>τ0)(Expτ
Accordingly, a direct computation of convolution of distributions shows that Expτ
In the illustrated example of
At block 510 of
In the example of
At block 540 of
In the example of
The example machine readable instructions 300, 500 of
In the preceding example, when the cache 240 is not full, a new key i is admitted with a sampling threshold τi equal to zero (τi=0). In such examples, when the cache 240 is full, such that the cache 240 contains k keys, a new key is provisionally admitted (such that the cache 240 now has k+1 keys), and at block 560 one of the k+1 keys in the cache 240 is selected to be removed. In the example above, the procedure E
In the foregoing example, it can be shown by the following that the estimated weight {circumflex over (v)}i remains unbiased under the action of the removal procedure described above (E
In the foregoing example, the count {tilde over (c)}i corresponds to the action on the count ci described with respect to increasing the sampling threshold τi, as described herein, and can be found to be:
{tilde over (c)}
i
=I(τ′ui>τi)[ci−Expτ′]++I(τ′ui≦τi)ci (15)
where Expτ′ is equivalent to −τ′ log zi. When Ti<τ′, corresponding to a case where {τ′>τi/ui}∩{τ′>ci/(−log zi)}, the key i is selected by the cache controller 250 to be removed from the cache 240. This corresponds to the case where {τ′>τi/ui}∩{τ′>ci/(−log zi)}. When τ′>τi/ui, the first term in the above expression for {tilde over (c)}i is selected, whereas when τ′>ci/(−log zi), the count {right arrow over (c)}i is set to zero ({tilde over (c)}i=0) and the corresponding key i is removed from the cache 240. If Ti≧τ′, the first or second term in the expression for {tilde over (c)}i may be selected, but both cannot be equal to zero.
Let the value {tilde over (v)}i denote the estimate of vi based on {right arrow over (c)}i=Θ(ci,τi,τ′). Then, from the discussion above
E[{tilde over (v)}
i
|τ′,c
i>0]=E[I({tilde over (c)}i>o)({tilde over (c)}i+τ′)|τ′,ci>0]=ci+τi (16)
independent of τ′, and, therefore, E[{tilde over (v)}i]=E[I(ci>0)(ci+τi)]=vi.
In some examples, when ci=d[vi−Expτ]+, the unbiased estimate {circumflex over (v)}i=I(ci>0)(ci+τi) has variance:
Var[{circumflex over (v)}
i]=τi2(1−e−v
Moreover, Var[{circumflex over (v)}i] itself has an unbiased estimator si2 that does not depend explicitly on the value vi:
s
i
2
=I(ci>0)τi2. (18)
In the illustrated example, for a key in the cache 240, uncertainty concerning vi is determined by the estimated unsampled updates, which are exponentially distributed with mean τi and variance τi2. Accordingly, both Var[{circumflex over (v)}i] and si2 are increasing functions of τi.
In some examples, the estimated variance with a given key is non-decreasing while the key i is stored in the cache 240, then drops to zero when the key i is removed from the cache 240, but may increase after further updates for that key i are sampled by the sampler 210. In such examples, because si2 is increasing in τi, an upper bound on the variance is obtained using the maximum threshold τ* encountered over all instances of the keys stored in the cache 240 analyzed during the removal procedure (E
In some examples, an overwrite streams model may be implemented, where an update of the form (i,v) means that the weight associated with key i is updated to v. In this example, the update (i,0) corresponds to a deletion of the key. This example captures the notion of “updating” information about a sample.
In the overwrite streams example, for sampling at a constant rate q=1/τ, if the sampler 210 samples an update for a key i which is already stored in the cache 240, the key i is removed from the cache 240. In some such examples, an independent determination can be made by the cache controller 250 whether to retain the current key i based on its weight v and the sampling threshold τ. In such examples, correctness (i.e. that the sampling is distributed as a Poisson sample on the final values for each key) is immediate. In some examples, a removal procedure similar to E
The processor platform 600 of the instant example includes a processor 612. For example, the processor 612 can be implemented by one or more microprocessors or controllers from any desired family or manufacturer.
The processor 612 includes a local memory 613 (e.g., a cache) and is in communication with a main memory including a volatile memory 614 and a non-volatile memory 616 via a bus 618. The volatile memory 614 may be implemented by Synchronous Dynamic Random Access Memory (SDRAM), Dynamic Random Access Memory (DRAM), RAMBUS Dynamic Random Access Memory (RDRAM) and/or any other type of random access memory device. The non-volatile memory 616 may be implemented by flash memory and/or any other desired type of memory device. Access to the main memory 614, 616 is controlled by a memory controller.
The processor platform 600 also includes an interface circuit 620. The interface circuit 620 may be implemented by any type of interface standard, such as an Ethernet interface, a universal serial bus (USB), and/or a PCI express interface.
One or more input devices 622 are connected to the interface circuit 620. The input device(s) 622 permit a user to enter data and commands into the processor 612. The input device(s) 622 can be implemented by, for example, a keyboard, a mouse, a touchscreen, a track-pad, a trackball, isopoint and/or a voice recognition system.
One or more output devices 624 are also connected to the interface circuit 620. The output devices 624 can be implemented, for example, by display devices (e.g., a liquid crystal display, a cathode ray tube display (CRT), and/or speakers). The interface circuit 620, thus, typically includes a graphics driver card.
The interface circuit 620 also includes a communication device (e.g., the data port 204 of
The processor platform 600 also includes one or more mass storage devices 628 for storing software and data. Examples of such mass storage devices 628 include floppy disk drives, hard drive disks, compact disk drives and digital versatile disk (DVD) drives. The mass storage device 628 may implement the cache 240.
The coded instructions 632, which may implement the coded instructions 300, 400, 500 of
From the foregoing, it will appreciate that the example methods, apparatus and articles of manufacture have been disclosed to enable sampling of weighted updates, whether positive or negative.
At least some of the above described example methods and/or apparatus are implemented by one or more software and/or firmware programs running on a computer processor. However, dedicated hardware implementations including, but not limited to, application specific integrated circuits, programmable logic arrays and other hardware devices can likewise be constructed to implement some or all of the example methods and/or apparatus described herein, either in whole or in part. Furthermore, alternative software implementations including, but not limited to, distributed processing or component/object distributed processing, parallel processing, or virtual machine processing can also be constructed to implement the example methods and/or apparatus described herein.
To the extent the above specification describes example components and functions with reference to particular standards and protocols, it is understood that the scope of this patent is not limited to such standards and protocols. For instance, each of the standards for Internet and other packet switched network transmission (e.g., Transmission Control Protocol (TCP)/Internet Protocol (IP), User Datagram Protocol (UDP)/IP, HyperText Markup Language (HTML), HyperText Transfer Protocol (HTTP)) represent examples of the current state of the art. Such standards are periodically superseded by faster or more efficient equivalents having the same general functionality. Accordingly, replacement standards and protocols having the same functions are equivalents which are contemplated by this patent and are intended to be included within the scope of the accompanying claims.
Additionally, although this patent discloses example systems including software or firmware executed on hardware, it should be noted that such systems are merely illustrative and should not be considered as limiting. For example, it is contemplated that any or all of these hardware and software components could be embodied exclusively in hardware, exclusively in software, exclusively in firmware or in some combination of hardware, firmware and/or software. Accordingly, while the above specification described example systems, methods and articles of manufacture, the examples are not the only way to implement such systems, methods and articles of manufacture. Therefore, although certain example methods, apparatus and articles of manufacture have been described herein, the scope of coverage of this patent is not limited thereto. On the contrary, this patent covers all methods, apparatus and articles of manufacture fairly falling within the scope of the claims either literally or under the doctrine of equivalents.