The invention relates to the field of quantile tracking and, more specifically but not exclusively, to incremental quantile tracking.
Quantiles are useful in characterizing the data distribution of evolving data sets. For example, quantiles are useful in many applications, such as in database applications, network monitoring applications, and the like. In many such applications, quantiles need to be tracked dynamically over time. In database applications, for example, operations on records in the database, e.g., insertions, updates, and deletions, change the quantiles of the data distribution. Similarly, in network monitoring applications, for example, anomalies on data streams need to be detected as the data streams change dynamically over time. Computing quantiles on demand is quite expensive, and, similarly, computing quantiles periodically can be prohibitively costly as well. Therefore, it is desirable to incrementally track quantiles of the data distribution.
Most incremental quantile estimation algorithms are based on a summary of the empirical data distribution, using either a representative sample of the distribution or a global approximation of the distribution. In such incremental quantile estimation algorithms, quantiles are computed from summary data. Disadvantageously, however, in order to obtain quantile estimates with good accuracies (especially for tail quantiles, for which the accuracy requirement tends to be higher than for non-tail quantiles), a large amount of summary information must be maintained, which tends to be expensive in terms of memory. Furthermore, for continuous data streams having underlying distributions that change over time, a large bias in quantile estimates may result since most of the summary information is out of date.
By contrast, other incremental quantile estimation algorithms use stochastic approximation (SA) for quantile estimation, in which the data is viewed as being quantities from a random data distribution. The SA-based quantile estimation algorithms do not keep a global approximation of the distribution and, thus, use negligible memory for estimating tail quantiles. Disadvantageously, however, the existing SA-based quantile estimation algorithms derive each quantile estimate individually, in isolation, which causes problems in incremental quantile estimation. First, derivation of the quantile estimates individually often leads to a violation of the monotone property of quantiles (e.g., such as where the value of the 90% quantile is less than the value of the 80% quantile). Second, although this incremental nature is amenable to continuous data updates, use of derivative information renders the SA-based quantile estimation algorithms sensitive to data order and the particular data distribution during intermediate updates. Third, the existing SA-based quantile estimation algorithms cannot handle dynamic underlying data distributions. These and other issues associated with existing SA-based quantile estimation algorithms present challenges for applications in which incremental quantile tracking is performed.
Various deficiencies in the prior art are addressed via methods, apparatuses, and computer readable mediums for performing incremental quantile tracking of multiple quantiles using stochastic approximation.
In one embodiment, a method for performing an incremental quantile update using a data value of a received data record includes determining an initial distribution function, updating the initial distribution function to form a new distribution function based on the received data value, generating an approximation of the new distribution function, and determining new quantile estimates from the approximation of the new distribution function. The initial distribution function includes a plurality of initial quantile estimates and a respective plurality of initial probabilities associated with the initial quantile estimates. The initial distribution function is updated to form the new distribution function based on the received data value. The new distribution function includes a plurality of quantile points identifying the respective initial quantile estimates and a respective plurality of new probabilities associated with the respective initial quantile estimates. The approximation of the new distribution function is generated by, for each pair of adjacent quantile points in the new distribution function, connecting the adjacent quantile points using a linear approximation of the region between the adjacent quantile points. The new quantile estimates and the new probabilities associated with the new quantile estimates may then be stored.
The teachings herein can be readily understood by considering the following detailed description in conjunction with the accompanying drawings, in which:
To facilitate understanding, identical reference numerals have been used, where possible, to designate identical elements that are common to the figures.
A capability for incremental tracking of quantiles using stochastic approximation (SA), denoted as an SA-based incremental quantile estimation capability, is depicted and described herein. In general, in incremental quantile estimation, the quantiles for a set of data are updated in real or near-real time as data is received, such that the incremental quantile estimation provides a relatively current estimate of the quantiles for the set of data received up to the current time. The SA-based incremental quantile estimation capability enables incremental tracking of multiple quantiles over time, where each of the quantile estimates is updated for each data record that is received, thereby ensuring that, at any given time, the monotone property of quantiles is maintained. The SA-based incremental quantile estimation capability enables incremental tracking of multiple quantiles for different record types, such as insertions, deletions, and updates. The SA-based incremental quantile estimation capability is adaptive to changes in the data distribution. The SA-based incremental quantile estimation capability only needs to track quantiles of interest and, thus, is memory efficient (as opposed to non-SA-based quantile estimation algorithms in which the memory requirements are dependent on which quantile is being estimated, e.g., tail quantiles require more memory).
The SA-based incremental quantile estimation capability incrementally tracks the estimated quantiles of distribution function F(x) using incremental approximations to distribution function F(x) upon receiving new data values. A current data value of a set of received data values {x} is denoted as data value xt received at time t. The SA-based incremental quantile estimation capability updates the approximation to the distribution function F(x) based on received data value xt, such that the quantile estimates are denoted as St=(St(1), St(2), . . . , St(K))) and the probabilities associated with the quantile estimates St are denoted as probabilities pt=(pt(1), pt(2), . . . , pt(K))). A method, according to one embodiment, for tracking the estimated quantiles of distribution function F(x) using an incremental approximation to distribution function F(x) upon new data arrivals is depicted and described with respect to
At step 202, the method 200 begins.
At step 204, an insertion record is received. The insertion record includes a new data value xt. The new data value xt may be any suitable value and may be received in any suitable manner, which may depend, at least in part, on the application for which incremental tracking of estimated quantiles is performed (e.g., receiving a data insertion record for a database, receiving a data value in a data stream in a network, and the like).
At step 206, an initial distribution function (denoted as {circumflex over (F)}t−1 ) is determined.
The initial distribution function {circumflex over (F)}t−1 has properties similar to the distribution function F(x) depicted and described with respect to
In one embodiment, the initial distribution function {circumflex over (F)}t−1 may be the distribution function determined during a previous time (t−1) at which the previous data record was received (e.g., the initial distribution function {circumflex over (F)}t−1 may be the approximation of the new distribution function determined during the previous execution of method 200 at previous time (t−1), where method 200 has already been executed for one or more previously received data records).
At step 208, the initial distribution function {circumflex over (F)}t−1 is updated to form a new distribution function (denoted as {circumflex over (F)}t) based on the new data value xt.
The new distribution function {circumflex over (F)}t includes a plurality of new probabilities (pt(i), 1≦i≦K) associated with the initial quantile estimates St−1(i) of the initial distribution function {circumflex over (F)}t−1.
In one embodiment, the initial distribution function {circumflex over (F)}t−1 is updated to form new distribution function {circumflex over (F)}t by determining the new probabilities pt(i) for the new distribution function {circumflex over (F)}t using pt(i)=(1−wt)pt−1(i)+wtI(St−1(i)≧xt). In this equation, xt is the new data value, wt is a weight associated with the new data value xt (which may be chosen in any suitable manner), St−1(i) are the initial quantile estimates of initial distribution function {circumflex over (F)}t−1, pt−1(i) are the initial probabilities associated with initial quantile estimates St−1(i), I(St−1(i)≧xt) is an indicator function and i is a counter over the set of quantile estimates and probabilities (1≦i≦k). This equation follows from updating initial distribution function {circumflex over (F)}t−I as {circumflex over (F)}t(x)=(1−wt){circumflex over (F)}t−1(x)+wtI(x≧xt), evaluating {circumflex over (F)}t(x) at initial quantile estimates St−1(i) at time t−1, and, using the fact that {circumflex over (F)}t−1(St−1(i))≈p(i), thereby giving the equation: {circumflex over (F)}t(St−1(i))≈(1−wt)p(i)+wtI(St−1(i)≧xt), which may then be represented as pt(i)=(1−wt)p(i)+wtI(St−1(i)≧xt). The combination of the initial quantile estimates St−1(i) and the new probabilities pt(i) provides a set of quantile points (St−1(i), pt(i)) which defines new distribution function {circumflex over (F)}t.
At step 210, an approximation of the new distribution function is generated.
In one embodiment, linear interpolation is used to generate the approximation of the new distribution function such that, in the neighborhood of each of the initial quantile estimates St−1(i), the approximation of the new distribution function is a linear function with a slope specified by the respective initial derivative estimates ft−1(i) associated with the initial quantile estimate St−1(i), and the linear points around the initial quantile estimates St−1(i) are extended under the constraints of monotonicity of the interpolation function.
In one embodiment, generating the approximation of the new distribution function includes, for each pair of adjacent quantile points in the new distribution function {circumflex over (F)}t (where each pair of adjacent quantile points includes a first quantile point (St−1(i), pt(i)) and a second quantile point (St−1(i+1), pt(i+1)) performing the following: (1) defining a right quantile point to the right of the first quantile point and a left quantile point to the left of the second quantile point; and (2) generating a linear approximation of the new distribution function for the region between the adjacent quantile points by connecting the first quantile point, the right quantile point, the left quantile point, and the second quantile point in a piecewise linear fashion. In one such embodiment, definition of the right quantile points and the left quantile points is performed using the initial quantile estimates St−1(i), the initial derivative estimates ft−1(i), the new probabilities pt(i), and monotonicity values Δt(i). A more detailed description of one such embodiment is depicted and described with respect to
At step 302, method 300 begins.
At step 304, a counter associated with the quantile points is initialized to one (i=1, 1≦i≦K, where K is the number of estimated quantiles of the new distribution function).
At step 306, a pair of adjacent quantile points is determined. The pair of adjacent quantile points is determined based on the current value of the counter i. The pair of adjacent quantile points includes a first quantile point (St−1(i), pt(i)) and a second quantile point (St−1(i+1), pt(i+1)).
At step 308, a monotonicity value (denoted as Δt(i)) is computed for the pair of adjacent quantile points.
The monotonicity value Δt(i) is computed such that the right quantile point and the left quantile point are non-decreasing, i.e., such that:
[St−1(i)+Δt(i)]≦[St−1(i+1)−Δt(i)], and
[pt(i)+ft−1(i)Δt(i)]≦[pt(i+1)−ft−1(i+1)Δt(i)],
which indicate that:
The monotonicity value Δt(i) may be selected in any suitable manner. In one embodiment, for example, the monotonicity value Δt(i) is selected as the maximum possible value determined from the right-hand side of the above equation for monotonicity value Δt(i).
At step 310, the right quantile point (denoted as rightt(i)) and the left quantile point (denoted leftt(i+1)) are defined.
The right quantile point is a point to the right of the first quantile point, and is defined as follows: rightt(i)=(St−1(i)+Δ6(i), pt(i)+ft−1(i)Δt(i)), which is a point in the new distribution function {circumflex over (F)}t that is to the right of the first quantile point (St−1(i), pt(i)) with a slope of ft−1(i).
The left quantile point is a point to the left of the second quantile point, and is defined as follows: leftt(i+1)=(St−1(i+1)−Δt(i), pt(i+1)−ft−1(i+1)Δt(i)), which is a point in the new distribution function {circumflex over (F)}t that is to the left of the second quantile point (St−1(i+1), pt(i+1)) with a slope of ft−1(i+1).
At step 312, the first quantile point, the right quantile point, the left quantile point, and the second quantile point are connected to form a portion of the approximation of the new distribution function. The first quantile point, the right quantile point, the left quantile point, and the second quantile point are connected in a piecewise linear fashion such that the first quantile point is connected to the right quantile point, the right quantile point is connected to the left quantile point, and the left quantile point is connected to the second quantile point.
At step 314, a determination is made as to whether counter i is equal to K−1. If the counter i is not equal to K−1, method 300 proceeds to step 316. If the counter i is equal to K−1, method 300 proceeds to step 318.
At step 316, the counter i is incremented by one (i=i+1), and, from step 316, method 300 returns to step 304 so that the process can be repeated for the next pair of adjacent quantile points in the new distribution function {circumflex over (F)}t.
At step 318, the approximation of the new distribution function is extended beyond the two boundary quantile points until it reaches the extreme y-axis values of zero and one (i.e., the approximation of the new distribution function is extended to the left of the quantile point (St−1(1), pt(1)) until it reaches the y-axis value of zero and is extended to the right of quantile point (St−1(K),pt(K)) until it reaches the y-axis value of one).
At step 320, method 300 ends. Although depicted and described as ending (for purposes of clarity), in an embodiment in which method 300 is used as step 210 of method 200 of
As depicted in
In
The curve functions 410A and 410B represent the hypothetical smooth approximation of the data distribution of new distribution function {circumflex over (F)}t between first quantile point (St−1(1), pt(1)) and second quantile point (St−1(2), pt(2)).
The linear functions 420A and 420B represent the piecewise linear approximations of the new distribution function i; between first quantile point (St−1(1), pt(1)) and second quantile point (St−1(2), pt(2)), determined using first and second quantile points (St−1(1), pt(1)) and (St−1(2), pt(2)), initial derivative estimates ft−1(1) and ft−1(2) associated with first and second quantile points (St−1(1), pt(1)) and (St−1(2), pt(2)), respectively, and monotonicity value Δt(1).
Returning now to
At step 212, new quantile estimates (denoted as St(i)) are determined from the approximation of the new distribution function. The new quantile estimates St(i) are determined from the approximation of the new distribution function as follows: {circumflex over (F)}t(St(i))=pi.
At step 214, the new quantile estimates St(i) and the new probabilities pt(i) of the approximation of the new distribution function are stored. The new quantile estimates St(i) and the new probabilities pt(i) may be stored in any suitable manner.
In one embodiment, for example, the new quantile estimates St(i) and the new probabilities pt(i) may be stored as respective sets of data values (namely, as a set of new quantile estimates St(i)={St(1), . . . , St(K)} and a set of new probabilities pt(i)={pt(1), . . . , pt(K)}.
In one embodiment, for example, the new quantile estimates St(i) and the new probabilities pt(i) may be stored by storing the approximation of the new distribution function.
The storage of new quantile estimates St(i) and new probabilities pt(i) of the new distribution function enables queries for quantile estimates St(i) to be answered. A method according to one embodiment for responding to queries of quantile estimates using the approximation of the new distribution function is depicted and described with respect to
At step 216, new derivative estimates (denoted as ft(i)) associated with new quantile estimates St(i) are determined.
In one embodiment, new derivative estimates ft(i) may be determined as follows: ft(i)=(1−wt)ft−1(i)+wtI(|xt−St(i)|≦c/{2c}), where c is a tunable parameter representing the window size around each of the new quantile estimates St(i) for which the respective new derivative estimates ft(i) are determined. The window sizes c may be set to any suitable values. In one embodiment, for example, the window sizes c each are a fraction of the estimated inter-quantile range, and the window sizes c are the same for all quantiles. In another embodiment, for example, the values of window sizes c are set such that the window sizes c are not uniform across all quantiles.
It will be appreciated that, since the new derivative estimates ft(i) are not required for use in responding to queries for quantile estimates St(i), determining the new derivative estimates may be viewed as an extraneous step performed for purposes of executing method 200 for each received data value. In one embodiment, as depicted in
At step 218, method 200 ends.
Although depicted and described as ending (for purposes of clarity), it will be appreciated that method 200 may be executed for each new insertion record that is received for purposes of incrementally updating quantile estimates.
The SA-based incremental quantile estimation capability depicted and described herein enables incremental tracking of multiple quantiles over time for data with stationary distributions and data with non-stationary distributions. Additionally, the SA-based incremental quantile estimation capability depicted and described herein may utilize multiple types of weights wt in updating the initial distribution function to form the new distribution function. For example, the weights wt may be diminishing (e.g., wt=1/t) or constant (wt=w), or set in any other suitable manner.
For stationary data (i.e., {circumflex over (F)}t is stationary), simple SA-based algorithms, in which each of the quantile estimates is updated individually in isolation, will lead to convergence for both of the types of weights wt described above. For diminishing weights wt set as wt=1/t, convergence using simple SA-based algorithms is to the true quantile in probability one. For constant weights wt set as wt=w, convergence using simple SA-based algorithms is in distribution to a random variable with mean of the true quantile. These convergence results also are true for the SA-based incremental quantile estimation capability depicted and described herein in which each of the quantile estimates is updated for each received data record. For weights wt set as wt=1/t, as t approaches infinity, the SA-based incremental quantile estimations depicted and described herein will converge to true quantiles. For weights wt set as wt=w, as t approaches infinity, the SA-based incremental quantile estimations depicted and described herein will converge in distribution to a random variable with mean of the true quantile. In one embodiment, for non-stationary data (i.e., {circumflex over (F)}t is non-stationary), the SA-based incremental quantile estimations depicted and described herein will use constant weights (wt=w) as opposed to diminishing weights (wt=1/t).
It will be appreciated that the weights wt used in updating the initial distribution function to form the new distribution function, as depicted and described with respect to
Although primarily depicted and described herein with respect to an embodiment in which estimated quantiles are updated for each new insertion record that is received (i.e., method 200 is executed for each new data value xt that is received), in other embodiments estimated quantiles may be updated using a batch of M insertion records (i.e., a batch of M data values {xt}M). In one such embodiment, steps 204-208 are performed for each of the M data values, and then steps 210-214 are performed once for the batch of M data values using the new distribution function that reflects the M data values. It will be appreciated that method 200 of
The SA-based incremental quantile estimation capability uses an incremental distribution approximation by interpolating at the updated quantile points. As a result, local to the quantile points the incremental distribution approximation is the same linear function as in existing SA-based quantile estimation algorithms in which each quantile point is updated individually in isolation from other quantile points, whereas globally the incremental distribution approximation is an increasing function.
The SA-based incremental quantile estimation capability opens up the possibility of using other more elaborate interpolation or approximation schemes given the local approximations at the quantile points. The SA-based incremental quantile estimation capability also opens up the possibility of using an asymptotic model to overcome some of the instabilities of SA-based incremental quantile estimation schemes in dealing with extreme tails (e.g., due to very small derivatives associated with extreme tails). It will be appreciated that care must be taken to ensure that utilizing such interpolation or approximation schemes does not lead to biases in quantile estimates (e.g., such as where using linear interpolation by connecting quantile points directly without using the local derivatives provides convergence for stationary data, but with a bias).
The SA-based incremental quantile estimation capability enables the updated quantile estimates to be computed relatively efficiently, while at the same time providing good approximations of quantile estimates.
It will be appreciated that, since the distribution approximation is piecewise linear, finding the quantile points of the function for updating (as in step 212) is relatively simple (e.g., by determining which line segment each probability p(i) falls into and then solving p(i) for that line segment).
It will be further appreciated that the estimated derivative ft is a vector of estimated derivatives (density) and that it is not crucial to obtain exact values of the derivatives. For example, if estimated derivative ft is replaced by a vector of fixed positive constants, the quantile estimates derived using the SA-based incremental quantile estimation capability still provide good approximations; however, it is more efficient to use a value of estimated derivative ft that is close to the actual derivatives of the distribution function since the quantile estimates will stabilize faster around the true value.
Although primarily depicted and described herein with respect to embodiments in which the SA-based incremental quantile estimation capability is utilized for incrementally approximating a distribution function Ft(·) that is a strictly increasing continuous distribution, other embodiments of the SA-based incremental quantile estimation capability may be utilized for incrementally approximating a distribution function Ft(·) that is a discrete distribution. In such embodiments, the SA-based incremental quantile estimation capability may be modified in order to prevent the derivative estimates from becoming infinite. The SA-based incremental quantile estimation capability may be modified in any suitable manner (e.g., by adding a small random noise to the data, where the small random noise may be chosen in a data dependent fashion).
Although primarily depicted and described herein with respect to embodiments in which the SA-based incremental quantile estimation capability is used for a set of data records including only one specific record type (namely, for insertion records), the SA-based incremental quantile estimation capability also may be used for a set of data records including only one specific record type where the one specific record type is different (e.g., using deletion records, update records, and the like) and/or for a set of data records including multiple record types (e.g., using a combination of two or more of insertion records, deletion records, update records, and the like). A description of such embodiments follows.
In one embodiment, the SA-based incremental quantile estimation capability is used for a set of data records including multiple record types. As described herein, the SA-based incremental quantile estimation capability is, in general, based on performing incremental approximations to a distribution function and, thus, the manner depicted and described hereinabove for performing incremental approximations to a distribution function for a set of data records including a single record type (namely, insertion records) is modified to perform incremental approximations to a distribution function for a set of data records including multiple records types. A description of the modification follows.
In this embodiment, assume that the set of data records for which incremental quantile approximation is performed includes insertion records, deletion records, and updated records.
In this embodiment, assume that at time t there is always a data value xt inserted, but at the same time there also could be: (1) a data value xt
In this embodiment, let wt be a sequence of intended or initial weights for the insertion data value xt at time t. The weights for the insertion data value xt are deemed to be intended or initial, because the actual weights for the insertion data value xt will be modified due to deletion. For deletion data value xt
In this embodiment, assume that the approximation of the distribution function at time t−1 is denoted as {circumflex over (F)}t−1. Additionally, define total weights value D0=0. The approximation of the distribution function at time t−1 is the initial distribution function {circumflex over (F)}t−1 at time t (similar to step 206 described with respect to
At time t, with the insertion record including insertion data value xt, updating of the initial distribution function {circumflex over (F)}t−1 and the initial total weights value Dt−1 may be represented as follows:
If there are no deletion or update records at time t, the updating of the initial distribution function {circumflex over (F)}t−1 is complete (because no further update of the initial distribution function {circumflex over (F)}t−1 is required at time t).
If there is a deletion record or an update record at time t, the updated distribution function {circumflex over (F)}t that is generated based on the insertion record is further updated to account for the deletion or insertion.
At time t, if there is a deletion record indicating deletion of data value xt
where dt
At time t, if there is an update record indicating update of data value xt
It will be appreciated from these update equations that an update record is treated as a combination of a deletion record and an insertion record for time t (i.e., the data value to be updated is deleted and replaced with the new value).
In the above-defined equations for insertion, deletion, and update records, the total weights value Dt represents the total of all weights from data values deleted at time t. As such, the total weights of data that contributed to updated distribution function {circumflex over (F)}t at time t is not one, but, rather, is 1−Dt due to deletions.
For the insertion equations, with the arrival of new data value xt, the updated distribution function {circumflex over (F)}t is the weighted sum I(x≧xt−1) from insertion data value xt with weight wt, and initial distribution function {circumflex over (F)}t−1 with weight (1−wt)(1−Dt−1), normalized to have a total weight of one. Additionally, the weight of the deleted data in {circumflex over (F)}t is updated by a factor of (131 wt).
As described hereinabove, from the above-described equations, the equations adapted for use in updating the initial probabilities pt−1(i) to form the new probabilities pt(i) may be derived. Namely, the equations adapted for use in updating the initial probabilities pt−1(i) to form the new probabilities pt(i) may be derived by evaluating the new distribution function {circumflex over (F)}t at each of the initial quantile estimates St−1(i) at time t−1.
The initial probabilities pt−1(i) are updated to form the new probabilities pt(i) (similar to step 208 described with respect to
At time t, with the insertion record including insertion data value xt: (a) the initial probabilities pt−1(i) are updated to form intermediate probabilities ptINT(i) and (b) and the initial total weights value Dt−1 is updated to form an intermediate total weights value DtINT, as follows:
If there are no deletion or update records at time t, the intermediate probabilities ptINT(i) are denoted as new probabilities pt(i) (because no further update of the probabilities is required at time t).
If there is a deletion record or an update record at time t, the intermediate probabilities ptINT(i) are further updated, based on the deletion or update record, in order to determine new probabilities pt(i).
At time t, if there is a deletion record indicating deletion of data value xt
where dt
At time t, if there is an update record indicating update of data value xt
Update: pt(i)←(dt
As described herein, the single-record-type case for incrementally tracking estimated quantiles of a data distribution (depicted and described with respect to
As depicted in
At step 510, the initial probabilities pt−1(i) associated with the initial quantile estimates St−1(i) of initial distribution function {circumflex over (F)}t−1 are updated to form intermediate probabilities ptINT(i) and the initial total weights value Dt−1 is updated to form an intermediate total weights value DtINT. The intermediate probabilities ptINT(i) and intermediate total weights value DtINT are determined as follows:
At step 520, a determination is made as to whether a deletion record or an update record has been received along with the insertion record. If neither a deletion record nor an update record has been received (i.e., only an insertion record was received at time t), method 500 proceeds to step 530. If a deletion record was received at time t, method 500 proceeds to step 540. If an update record was received at time t, method 500 proceeds to step 550.
At step 530, since only an insertion record was received at time t, the intermediate probabilities ptINT(i) determined in step 510 become the new probabilities pt(i) associated with initial quantile estimates St−1(i) to form thereby new distribution function {circumflex over (F)}t, and the intermediate total weights value DtINT determined in step 510 becomes the new total weights value Dt.
At step 540, since a deletion record was received in addition to the insertion record: (a) the intermediate probabilities ptINT(i) determined in step 510 are updated again to become the new probabilities pt(i) associated with initial quantile estimates St−1(i) to form thereby new distribution function {circumflex over (F)}t, and (b) the intermediate total weights value DtINT determined in step 510 is updated again to become new total weights value Dt. The new probabilities pt(i) and new total weights value Dt are determined as follows:
where dt
At step 550, since an update record was received in addition to the insertion record, the intermediate probabilities ptINT(i) that were determined in step 510 are updated again to become the new probabilities pt(i) associated with initial quantile estimates St−1(i) to form thereby new distribution function {circumflex over (F)}t. As described hereinabove, the intermediate probabilities ptINT(i) are updated based on the update record as follows:
Update: pt(i)←(dt
As depicted in
Although primarily depicted and described herein with respect to an embodiment in which the extended version of the SA-based incremental quantile estimation capability supports a set of data records that includes insertion records, deletion records, and updated records, other embodiments of the extended version of the SA-based incremental quantile estimation capability may support sets of data records that include other types and/or combinations of records (e.g., where the set of data records includes insertion records and deletion records, where the set of data records includes insertion records and update records, and the like). In one embodiment, the types of records that are included in the set of data records for which the SA-based incremental quantile estimation capability is implemented may be dependent on the application for which the SA-based incremental quantile estimation capability is used (e.g., database applications, networking applications, and the like).
The SA-based incremental quantile estimation capability depicted and described herein for multiple-record-type implementations may utilize multiple types of weights wt in updating the initial distribution function to form the new distribution function. For example, the weights wt may be diminishing (e.g., wt=1/t) or constant (wt=w), or set in any other suitable manner.
For diminishing weights wt set as wt=1/t, it will be appreciated that Dt is the ratio of deletes in the data. Assuming that this is true for t−1, and further assuming that there are k deletions, then, with the arrival of insertion data value xt, by 16, {circumflex over (F)}t(x) is the weighted sum of {circumflex over (F)}t−1(x) and I(x≧xt) with weights (t−k−1)/(t−k) and 1/(t−k), and Dt=k/(t+1) is actually the ratio of deletes in the data up to time t. It also will be appreciate that this may be verifies for the deletion and updated equations (17 and 18). In one such embodiment, the actual weight given to xt is 1/(t−k), not the intended weight 1/t.
For constant weights wt set as wt=w (where w is positive), let s1<s2< . . . <sk be the index of the data that are deleted until time t, where k is the total number of deletes before time t. With the arrival of insertion data value xt, it can be shown that Dt=(1−w)t−s
It will be appreciated that the weights wt used in updating the initial distribution function to form the new distribution function may be set in any other suitable manner.
With respect to the SA-based incremental quantile estimation capability depicted and described herein for multiple-record-type implementations, in the case of deletions and updates for stationary data that will result in equilibrium, for example, when the deletes occurs at a lag with a stationary random distribution, the estimated quantiles converge to the true quantiles. A heuristic understanding of this convergence is that our insertion, deletion, and update equations depicted and described herein are designed in such a way that the effect of deleted data is diminished in the functional approximation of {circumflex over (F)}t(x), and thus quantiles of the remaining data will have the correct quantiles.
It will be appreciated that the modified/additional embodiments that are described with respect to the single-record-type implementations of the SA-based incremental quantile estimation capability also apply to the multiple-record-type implementations of the SA-based incremental quantile estimation capability (e.g., batch processing of insertion records, support for both continuous and discrete distribution functions, and the like, as well as various combinations thereof).
At step 602, method 600 begins.
At step 604, a quantile query request is received.
The quantile query request may be any quantile query request. For example, the quantile query request may be a request for a quantile for a specific value, a request for a quantile for a range of values (e.g., for a portion of a bin, multiple bins, a range of values spanning multiple bins, and the like, as well as various combinations thereof).
The quantile query request may be received from any source. For example, the quantile query request may be received locally at the system performing incremental quantile estimation, received from a remote system in communication with the system performing incremental quantile estimation, and the like, as well as various combinations thereof.
The quantile query request may be initiated in any manner. For example, the quantile query request may be initiated manually by a user, automatically by a system, and the like, as well as various combinations thereof.
At step 606, a quantile query response is determined using a distribution function. As described herein, the distribution function is being updated in real time or near real time as data values are being received and, thus, the distribution function provides an accurate estimate of the current view of the quantile distribution. Thus, the quantile query response provides a current value of the quantile of the data value(s) for which the quantile query request was initiated.
At step 608, method 600 ends.
Although primarily described herein such that the distribution functions are said to include a plurality of quantile estimates and an associated plurality of probabilities, it will be appreciated by those skilled in the art and informed by the teachings herein that the distribution functions also may be said to be represented by a plurality of quantile estimates and an associated plurality of probabilities (as well as the associated derivative estimates associated with the quantile estimates).
Although depicted and described as ending (for purposes of clarity), it will be appreciated that method 600 of
It should be noted that the present invention may be implemented in software and/or in a combination of software and hardware, e.g., using application specific integrated circuits (ASIC), a general purpose computer or any other hardware equivalents. In one embodiment, the incremental quantile estimation process 705 can be loaded into memory 704 and executed by processor 702 to implement the functions as discussed above. As such incremental quantile estimation process 705 (including associated data structures) of the present invention can be stored on a computer readable medium or carrier, e.g., RAM memory, magnetic or optical drive or diskette, and the like.
It is contemplated that some of the steps discussed herein as software methods may be implemented within hardware, for example, as circuitry that cooperates with the processor to perform various method steps. Portions of the functions/elements described herein may be implemented as a computer program product wherein computer instructions, when processed by a computer, adapt the operation of the computer such that the methods and/or techniques described herein are invoked or otherwise provided. Instructions for invoking the inventive methods may be stored in fixed or removable media, transmitted via a data stream in a broadcast or other signal bearing medium, and/or stored within a memory within a computing device operating according to the instructions.
Although various embodiments which incorporate the teachings of the present invention have been shown and described in detail herein, those skilled in the art can readily devise many other varied embodiments that still incorporate these teachings.
This application claims the benefit of U.S. Provisional Patent Application Ser. No. 61/224,704, filed Jul. 10, 2009, entitled “INCREMENTAL TRACKING OF MULTIPLE QUANTILES” which is hereby incorporated by reference herein in its entirety. This application is related to U.S. patent application Ser. No. 12/546,344, filed Aug. 24, 2009, entitled “METHOD AND APPARATUS FOR INCREMENTAL QUANTILE TRACKING OF MULTIPLE RECORD TYPES,” which is hereby incorporated by reference herein in its entirety.
Number | Name | Date | Kind |
---|---|---|---|
6108658 | Lindsay et al. | Aug 2000 | A |
6820090 | Chambers et al. | Nov 2004 | B2 |
7076487 | Liechty et al. | Jul 2006 | B2 |
7076695 | McGee et al. | Jul 2006 | B2 |
7219034 | McGee et al. | May 2007 | B2 |
7313092 | Lau et al. | Dec 2007 | B2 |
8000929 | Bakshi et al. | Aug 2011 | B2 |
20080091691 | Tsuji | Apr 2008 | A1 |
20090271508 | Sommers et al. | Oct 2009 | A1 |
20100114526 | Hosking | May 2010 | A1 |
20100292995 | Bu et al. | Nov 2010 | A1 |
Entry |
---|
Moller et al., “Time-adaptive Quantile Regression”, Jan. 2008, Computational Statistics & Data Analysis, vol. 52 Issue 3, pp. 1292-1303. |
Chambers et al., “Monitoring Networked Applications with Incremental Quantile Estimation”, 2006, Statistical Science, vol. 21 No. 4, pp. 463-475. |
Jin Cao, et al., Incremental Tracking of Multiple Quantiles for Network Monitoring in Cellular Networks, MICNET '09, Sep. 21, 2009, Beijing, China. |
Fei Chen et al., “Incremental Quantile Estimation for Massive Tracking,” Proc. Of the Sixth International Conference on Knowledge Discovery and Data Mining, 2000, pp. 516-522. |
John M. Chambers et al., “Monitoring Networked Applications with Incremental Quantile Estimation,” Statistical Science, vol. 21, No. 4 (2006), pp. 463-475. |
Ichiro Takeuchi et al., “Nonparametric Quantile Estimation,” Journal of Machine Learning Research, vol. 7 (2006), pp. 1231-1264. |
Anna C. Gilbert et al., “Domain-Driven Data Synopses for Dynamic Quantiles,” IEEE Transactions on Knowledge and Data Engineering (Jul. 2005), 17(7):927-938. |
Number | Date | Country | |
---|---|---|---|
20110010327 A1 | Jan 2011 | US |
Number | Date | Country | |
---|---|---|---|
61224704 | Jul 2009 | US |