The disclosure concerns the field of application monitoring and observability within the broad field of information technology. In particular, the disclosure concerns the forecasting of time series of application monitoring data, observability data, and data observability data by quantile regression. The forecast data can be used for various purposes, e.g., for anomaly detection of reported signals, predictive capacity management, behavior prediction, pattern prediction etc.
Application performance monitoring (APM) and observability software allows users, such as IT operators, site reliability engineers, cloud and platform operators, application developers, and product owners, to observe and analyze the application health, performance, and user experience. Such APM and observability software may be self-hosted software, vendor-managed hosted software, or software as a service (SaaS). In order to monitor the performance of applications, computer or entire computer environments, massive amounts of data are logged, such as log lines, traces from applications, metrics etc. Such data, here referred to as application monitoring and observability data, supplemented by the temporal dimension (i.e. the date and time when a data signal was created or logged) of the data constitutes time series data.
Two examples describing the state of the art in capacity management of hard drive disks (HDD) are shown in
The basic idea underlying the disclosure is to use historical time series of application monitoring and observability data to forecast or predict the development of that data in future.
Various methods exist to forecast time series data into the future. The most commonly used method for predicting time series data is by minimum least square regression, i.e. to minimize the mean square error between the actual values of the time data and the value predicted by least square regression.
Despite being widely used today, investigations by the applicant have shown that the minimum least square regression is quite unsuitable to robustly forecast application monitoring and observability data. This is due to several reasons: First, predictions of time series data by minimum least square regression are sensitive to both noise and outliers. In other words, a single or multiple outliers can have a significant influence on the predicted data, making the prediction potentially unreliable. Second, minimum least square regression is unsuitable for highly correlated inputs. In other words, the time series is only predicted sufficiently accurately if i) the data lacks both noise and outliers and ii) the input data is uncorrelated to a high degree. Put differently, the forecasting of time series of application monitoring and observability data is robust only when certain assumptions are fulfilled. How the robust forecasting of time series of application monitoring and observability data can be improved without having to make assumptions about the underlying data is not known.
This section provides background information related to the present disclosure which is not necessarily prior art.
This section provides a general summary of the disclosure, and is not a comprehensive disclosure of its full scope or all of its features.
The object of the disclosure is to find i. a method for anomaly detection of application monitoring data, observability data, and data observability data, and ii. a method for forecasting a time series of application monitoring data, observability data and data observability data. These methods shall be robust against both noise and outliers and shall work reliably for highly correlated time series. In addition, the methods shall be efficient in terms of CPU load and memory consumption.
The first aspect of the disclosure is solved by a computer-
implemented method for detecting an anomaly in application monitoring data in a distributed computing environment, comprising:
Based on the training time series for a given variable data multiple, i.e. at least two, preferably at least three, probabilistic models are trained to forecast the quantitative development of the time series in future. The training time series data is used to train the models such that a first model of the multiple models can predict at least one lower quantile, e.g. the 10% quantile, of the time series and a second model of the multiple models can predict at least one higher quantile, e.g. the 90% quantile, of the time series. The probabilistic models are assumed to be of n-th order, e.g. considering n autoregressive features in order to compute a forecast value. After having received, by a computer processor, time series data for the given variable, y, in the distributed computing environment and extracting, by the computer processor, the last n values of the time series data, (y0, . . . , yn−1), for the given variable, the models are used to predict the future development of the time series by calculating at least one lower quantile forecast and at least one higher quantile forecast for the time series. Depending on the lower quantile forecast and the higher quantile forecast a lower threshold and a higher threshold are calculated. After having received a new value, yn, and comparing the new value to the lower and higher thresholds, yn is marked if the yn is greater than the higher threshold, i.e. yn>yhighT, or if yn is smaller than the lower threshold, i.e. yn<ylowT.
The lower threshold can be equal or different to the lower quantile forecast. Likewise, the higher threshold can be equal or different to the higher quantile forecast. In one simple approach, the difference between the lower and higher quantile forecasts defines the difference between the lower quantile forecast and the lower threshold, such that the lower threshold<the lower quantile forecast. In an analogue manner, the difference between the lower and higher quantile forecasts can also define the difference between the higher quantile forecast and the higher threshold, such that the higher threshold>the higher quantile forecast. Typically, there is a functional relationship between the lower quantile forecast, the higher quantile forecast . . . as inputs and the lower and higher threshold values as outputs. Different strategies exist and can be adapted as required.
In the next time step, n+1, the new value, yn, of the given variable, y, is appended by the computer processor at the end of the time series. After this, another time series, called a next time series, (y1, . . . , yn), is extracted from the time series for the given variable, y, where the next time series comprises the last n values of the time series. The next time series is used to calculate a lower quantile forecast using and the first probabilistic model and a higher quantile forecast using the second probabilistic model. Based on the lower and higher quantile forecasts a low threshold for the given variable and a high threshold for the given variable are calculated. After having received a new next value, yn+1, for the given variable, the next new value is compared to the low threshold and to the high threshold. The computer processor marks the next new value as an anomaly if the next new value is either less than the low threshold or greater than the high threshold.
According to a preferred embodiment, calculating the low and high thresholds comprises: determining a minimum value, e.g. a 0% quantile value, for the given variable using the lower quantile forecast and the higher quantile forecast and determining a maximum value, e.g. a 100% quantile value, for the given variable using the lower quantile forecast and the higher quantile forecast, thereby defining a range for the given variable; generating random numbers evenly distributed between 0 and 1; for each random number, mapping the given random number to a corresponding value in the range, thereby generating a set a values for the given variable; and determining the low threshold and the high threshold from the set of values for the given variable.
Doing this, multiple random numbers, e.g. N random numbers evenly distributed between 0 and 1, are mapped to the range for the given variable, such that the mapped N numbers represent a distribution in the range. The low and high threshold values are typically quantile values, e.g. a 10% quantile for the low threshold and a 90% for the high threshold.
Mapping a given random number to a corresponding value in the range is for example done by linear interpolation.
According to a typical case, the low threshold is set to a quantile value lower than the lower quantile value and the high threshold is set to a quantile value higher than the higher quantile value.
According to one simple embodiment, least one of the first probabilistic model and the second probabilistic model is a linear regression model.
According to a very preferred embodiment, training at least one of the first probabilistic model and the second probabilistic model uses the Regularized Smoothed Iterative Least Square (RSILS) method, which minimizes the pinball loss for the probabilistic models. The RSILS algorithm works best if the training time series comprises more or equal than 1000 values.
According to a another very preferred embodiment, training at least one of the first probabilistic model and the second probabilistic model uses the Coordinate Descent (CD) method. The CD algorithm works best if the training time series comprises less than 1000 values.
In some typical cases, the lower quantile and the higher quantile are symmetrical in relation to the median.
The first variant of the second aspect of the disclosure is solved by a computer-implemented method for forecasting application monitoring data in a distributed computing environment, comprising:
According to the first variant for forecasting a time series of application monitoring data, observability data and data observability data at least two probabilistic models, i.e. a first model being able to predict lower quantile values and a second model being able to predict higher quantile values for the given variable, preferably three probabilistic models, i.e. a first and a second model as above and a third model being able to predict median values for the given variable, are trained from a training time series data. The probabilistic models are assumed to be of n-th order with n>1, e.g. considering n autoregressive features in order to compute a forecast value. After having received, by a computer processor, time series data for the given variable, y, in the distributed computing environment and extracting, by the computer processor, the last n values of the time series data, (y0, . . . , yn−1), for the given variable, the models are used to predict the future development of the time series by calculating at least one lower quantile forecast and at least one higher quantile forecast for the time series. In forecasting not just a single timestep but typically many timesteps into the future are computed. According to the 1st variant, multiple paths of forecast are calculated one after the other. Having received time series data for the given variable, y, i) the last n data points of the time series are extracted from the time series data constituting the time series for the 1st step. Based on the time series, the computer processor ii) calculates a lower quantile forecast using the first probabilistic model and iii) a higher quantile forecast using the second probabilistic model. Based on the lower and higher quantile forecasts, iv) minimum and maximum values for the given variable are determined thereby defining a range for the given variable. Subsequently, a random number evenly distributed between zero and one is generated in step v and vi) a forecast value for the given variable is generated by mapping the random number to a corresponding value in the range. The forecast value is then vii) appended to the end of the time series and i)another time series is extracted using the last n data points from the time series. Steps i-vii are repeated until the path of forecast reaches a prediction horizon, e.g. 100 time-steps into the future.
The multiple forecast values at the individual time steps are used to identify quantile values in the multiple paths of forecasts, and these quantile values are reported to the method customer.
The second variant of the second aspect of the disclosure is solved by a computer-implemented method for forecasting application monitoring data in a distributed computing environment, comprising:
The difference between the 1st variant and the 2nd variant is that in the former case multiple paths of forecasts are computed until the end of the prediction horizon, and the values of the multiple paths are used to identify quantile values at individual time steps, whereas in the latter case, multiple random numbers are generated, multiple forecast values are generated by mapping these numbers to a range, e.g. between a virtual 0 and 100%. The forecast values at a certain time step can be used to derive the quantile values at that time step.
Contrary to the method for anomaly detection, two, preferably at least three, models are being trained to predict a lower quantile value, a median value, and an upper quantile value of the time series. Another difference between the methods for anomaly detection and forecasting is that the development of the time series is predicted not just for one time step, but for at least three time steps, typically many more.
In the methods for forecasting, the calculated lower and higher quantile forecasts are used to perform statistical experiments by random sampling the range between a minimum value (e.g. the virtual 0% quantile) and a maximum value (e.g. the virtual 100% quantile) of the forecasts. Based on the lower quantile forecast for τ=τlow and the higher quantile forecast for τ=τhigh the virtual 0% and 100% quantiles can be calculated. In every statistical experiment an evenly distributed random number between 0 and 1 is generated which is mapped to the range between the virtual 0% and 100% quantiles by linear interpolation. The mapped value of the random number constitutes one random forecast. The random sampling is performed multiple times and one or more quantiles can be calculated based on the multiple sampled forecasts. Note that the quantiles derived from random sampling can be different to the quantiles of the multiple probabilistic models.
In the fields mentioned above, the presented methods can be utilized e.g. for predictive capacity management, behavior prediction as a means of training a model to predict the behavior of an entity, such as a service in production, or pattern prediction as a possibility to predict the occurrence of patterns, such as log lines, events or problems based on historic behavior.
It is advantageous if the probabilistic models are considering at least one of i) autoregressive features in the time series data (e.g. to model local time dependencies and to react to local deviations from seasonality), ii) time and calendar features in the time series data (e.g. to model patterns depending on weekday, time of week, time of day), and iii) seasonal patterns in the time series data (e.g. to find seasonal patterns in the time series data occurring monthly, weekly, daily, hourly, or minutely or combinations thereof by robust versions of the Fourier Transformation.
Examples of models having autoregressive features are given in the application examples. SARIMA (Seasonal Autoregressive Integrated Moving Average) and SARIMAX (Seasonal ARIMA with exogenous regressors) are just two examples of models considering seasonality features.
In many real-world applications, the stop criterion is a temporal prediction horizon TPH, e.g. that the number of predictions n≥TPH.
In some cases it is beneficial if the quantiles of the first and the third model are defined symmetrical to the 0.5 quantile, e.g. 0.1 and 0.9, or 0.25 and 0.75.
It may be beneficial if before the start of the training of a probabilistic model a data validation is performed, the data validation checking whether the time series data is sufficiently large for forecasting and/or the number of missing values in the historical data is sufficiently low.
After the data validation step and before the start of the training of the probabilistic models it may be expedient to decide on which algorithm shall be used to solve optimization problem ŷ=X·β.
Generally it is expedient to use the Regularized Smoothed Iterative Least Squares (RSILS) algorithm if the time series comprises more or equal than 1000 values.
Conversely, it is expedient to use the Coordinate Descent (CD) algorithm if the time series comprises less than 1000 values.
The probabilistic models can be updated to the latest developments by repeating the training of the probabilistic models periodically, such as daily, weekly, monthly or multiples of these.
Typically, the quantiles found by performing the statistical experiments are output to a customer, which may be a person, a piece of software etc.
In some cases, it may be expedient to use the quantiles found by performing the statistical experiments in the linear interpolated numbers as lower quantile forecast and higher quantile forecast in anomaly detection.
Further areas of applicability will become apparent from the description provided herein. The description and specific examples in this summary are intended for purposes of illustration only and are not intended to limit the scope of the present disclosure.
The accompanying drawings, which are incorporated in and constitute part of this specification, illustrate embodiments of the disclosure and together with the description, serve to explain the principles of the disclosure. The embodiments illustrated herein are presently preferred, it being understood, however, that the disclosure is not limited to the precise arrangements and instrumentalities shown, wherein:
Corresponding reference numerals indicate corresponding parts throughout the several views of the drawings.
Corresponding reference numerals indicate corresponding parts throughout the several views of the drawings.
A first application example shows how training time series data is used to train multiple probabilistic models and how the trained models are able to calculate multiple numerical forecasts predicting the development of the time series in future. The forecasts can be used for i) making decisions in the field of application monitoring and observability of computer environments, such as capacity or resource planning, or can be used for ii) anomaly detection.
The time series of application monitoring and observability data is given by the equations
In order to make the time series more irregular, or in other words, less predictable, normally distributed noise is added to y, where the noise has a mean value of 1. The time series y represents e.g. changes in the HDD capacity or CPU load of a computer. The unit of time t can be e.g. hours, minutes, or multiples of it.
The code to calculate the time series y in the well-known programming language Python is given below:
The first 2000 values of the time series are used as training data y_train. The last 100 values of the time series are used as test data y_test.
The training data y_train, the first 500 values of the training data and the last 500 values of the training data are displayed in
Having specified the training time series data, the steps for training the probabilistic models used in the application example will be explained next.
In this example a simple linear regressive model
will be used to predict the time series y. The predicted values of the time series are denoted by ŷ, contrary to the time series itself y. X denotes a n×p feature matrix, where each row xi is one training example and y denotes the time series (also known as target vector) of length n, where each entry yi corresponds to xi. β is a vector of parameters or weighting factors. Note that the parameter vector β and the so-called intercept β0 can be combined into one vector β by adding a constant column of ones to the feature matrix.
The goal of training the probabilistic models by machine learning is to find the optimal parameters β*, which minimize the overall empirical loss function
where Lτ is the so-called quantile loss or pinball loss for the τ-th quantile:
In this application example, the cost function
will be minimized, where ∥β22=Σjβj2 and ∥β∥1=Σj|βj| denote the well known L2- and L1-regularization terms.
In the 1st step of model training by machine learning, the feature matrix X is created considering solely autoregressive features. The only other feature is the intercept, which is a constant column on the very righthand side.
In this example, 20 autoregressive features, i.e. n=20 as defined in Equ. 5, and one intercept value will be used. The name of the feature matrix X in the Python code is y_matrix_df.
Note that X=y_matrix_df has a dimension of 1980 rows (2000 rows of training data−20 autoregressive features) and 21 columns (20 autoregressive features plus the intercept). The first five and the last five rows of the feature matrix are shown in
After having defined the feature matrix X, the optimal values for the autoregressive features β are calculated by a modified Iterative Least Squares (ILS) algorithm. Note that the method presented below does not follow the classical ILS algorithm, since the cost function of Equ. 7 enables L2 and approximate L1 regularization. Since the classical ILS algorithm does not allow regularization, it is unsuitable for many real-world cases since outliers have a strong effect on the feature matrix β. Here, a modified ILS algorithm is used, which allows both L2 and L1 regularization. This algorithm is called RSILS (Regularized Smoothed Iterative Least Squares).
It is advantageous to use the RSILS algorithm for Quantile Regression for cases with ≥1000 datapoints. The algorithm provides an efficient and fast approximate solution of the regularized optimization problem.
As the quantile loss or pinball loss function given in Equ. Jτ(β,β0)=Σi=1nLτ(yi,xi·β) (5 is an unsteady function in the region around 0 (see e.g. τ=0.1 in
The smoothing is done by introducing the function
such that an optimized β can be determined in an iterative manner by Equ. 9.
where X is the feature matrix, c is the area of smoothing, δ is the area of smoothing for the L1 regularization, and I1, I2, I3 are matrices depending on the current coefficients.
The matrices I1, I2 and I3 are defined as
and W1, W2 are arrays depending on the current residuals
After defining all this, the application of the RSILS algorithm is quite straightforward, involving the following steps:
For the sake of simplicity, the optimization of β will be demonstrated for one quantile τ=0.25 only. We choose c=4, λ2=10−3 and λ1=0. The tolerance is set to 1×10−4. The corresponding code in Python is:
The initial solution (given by the classical solution of the simple “Ridge Regression”) for βnew rounded to three digits is:
The code for calculating the residuals is
residual=target−np.dot(x, beta_new)quantile_loss(residual, tau)
and the mean quantile loss is 0.444.
In the next step, W1, W2, I1, I2 and I3 are calculated
In step 5, βold is set to βnew.
Step 6 checks whether β is converging. In other words, if AbsMax(βold−βnew)<diff then βnew will be returned, otherwise new values for βnew will be calculated.
As β has not yet converged, βnew is calculated using the formulas given above and utilizing the following code:
This results in βnew=[0.197, 0.127, −0.006, −0.098, −0.116, −0.131, −0.105, −0.075, 0.025, −0.014, 0.041, −0.011, −0.06, −0.05, −0.012, −0.017, 0.091, 0.14, 0.193, 0.197, −0.772]
In the next step, the quantile loss is checked again:
The mean value of the quantile loss is 0.354.
Next it is checked whether the iteration has already converged
np.max(np.abs(beta_new−beta_old))
As the result 0.778 is still greater than the tolerance, further iterations are performed until the algorithm has converged. This is done using the following code:
After 4 iterations, the algorithm converges and β for τ=0.25 is:
Repeating the above steps for τ=0.5 and τ=0.75 yields:
By performing the above steps, three probabilistic models, namely a 1st model for predicting lower quantile values with τ=0.25, a 2nd model for predicting median values with τ=0.5, and a 3rd model for predicting higher quantile values with τ=0.75 are trained.
Generally, the next forecast value is given by ŷ=X·β (see Equ. 4). As we have three different parameter vectors β25, β50, β75 corresponding to τ=0.25, τ=0.5 and τ=0.75, the forecast value ŷq25 of the time series for the 25% quantile is given by ŷq25=X·β25. Likewise ŷq50=X·β50 and ŷq75=X·β75. Put differently, we have one model for forecasting the 25% quantile, one for the 50% quantile and another one for the 75% quantile of the time series.
As noted above, these models may be used to calculate forecast values for the 25% quantile, the 50% quantile and the 75% quantile of the predicted time series ŷq25, ŷq50 and ŷq75, respectively.
In Python this is done by the following code:
The predicted values for ŷq25, ŷq50 and ŷq75 at t=2000 (i.e. the 2001st value of the time series, since Python arrays start with index 0) are (1.067, 2.058, 3.022). In other words, the times series for calculating a lower quantile forecast, a median forecast, and a higher quantile forecast at t=2000 is defined by taking the last 20 values [0.43, −4.09, −3.67, −3.91, −1.24, −1.34, −2.13, −4.24, −4.96, −5.50, −5.72, −3.59, 1.25, 3.54, 8.11, 11.23, 8.11, 8.75, 1.16, −1.87] of the time series y, i.e. (y1989 . . . y1999), in inverse order and adding 1 on the right-hand side of the vector considering the intercept value yields the input features X_start. The time series vector is given also given in the last row of
If the anomaly detection shall be based on the 25% and the 75% quantiles of the forecasted time series, the lower threshold is 1.067 and the upper threshold is 3.022. If the received value ynat time step 2000 is greater than the upper threshold then this value is marked as a potential anormal data point, e.g. an outlier. Likewise, if the received value ynat time step 2000 is smaller than the lower threshold then this value is marked too.
In case the low and high thresholds, say 10% and 90% quantiles, differ from the quantiles used in training the probabilistic models, in a first step the minimum and a maximum values using the lower quantile forecast 1.067, the median forecast 2.058 and the higher quantile forecast 3.022 for the 25%, 50% and the 75% quantiles, respectively, are determined. As shown in
When forecasting time series data, the steps of receiving training time series data, training the probabilistic models and receiving time series data are identical to the method for anomaly detection. Conversely to anomaly detection, not just one time step into the future is predicted but multiple time steps.
The accuracy of prediction can be increased further by performing multiple statistical experiments on the forecast values. Although there is no authoritative definition for the term in the art, performing multiple statistical experiments in order to solve complex problems, is sometimes called random sampling, or “Monte Carlo Sampling”. In addition, statistical experiments allow not just the forecasting of certain pre-defined quantiles, i.e. the same quantiles used for training the model, but arbitrary quantiles. This is done by adding the function sample_value to the class ProbabilisticRegressionModel defined above:
The code for predicting the 5%, the 50% and the 95% quantiles up to a time horizon of 100 forecasts of the time series y is as follows:
The above code predicts 100 paths (num_paths=100) up to a prediction horizon of 100 (fh=100).
As above, first the forecast values for the 25%, 50% and 75% quantiles are calculated. The function sample_value maps these forecast values to a forecast range and calculates an evenly distributed random value between 0 and 1. The forecast range assumes that the difference between the 50% quantile and the 25% quantile is also the difference between the 25% quantile and the 0% quantile. Likewise, it is assumed that the difference between the 75% quantile and the 50% quantile is also the difference between the 100% quantile and the 75% quantile. The mapping to the segments is as follows:
Interpolating the random value to the forecast range yields one result of the statistical experiment. In other words, a random value of 0.5 is interpolated to be in the “middle” between yt+1_q0 and yt+1_q100, a random value of 0 is located at the lower bound and a random value of 1 is at the upper bound. Each random value between 0 and 1 mapped into the range between a low threshold and a high threshold represents one statistical experiment.
For calculating the forecast for the next time step, two possibilities exist:
According to a first variant of the method shown in
The second first variant shown in
For each time step, 100 statistical experiments are conducted and the 5% and the 95% quantile values are determined from the results of the statistical experiments. These quantile values define the upper and the lower bound of the predictions, respectively.
In a second application example it will be demonstrated how a modified algorithm for solving the optimization problem of Equ. 7 can be used to predict a time series of application monitoring and observability data.
The historical time series is identical to the time series used in the first application example. However, the length of the training data is 140 and the length of the test data is 70. One time step can be e.g. one second, one minute, one hour, one day, or multiples of it.
As in the first example, normally distributed noise is added to y, where the noise has a mean value of 1. To increase the readability of the time series, the y values are again rounded to 2 decimals.
The code for defining the training and test data in Python is:
The first 10 values of y are [2.01, 6.58, 10.7, 10.0, 8.91, 5.17, 0.0124, −2.22, −5.16, −5.67]. The training data y_train is shown in
Using the same Python code as above yields a feature matrix X with 120 rows (140 training data points−20 lags considered) and 21 columns (20 lags+1 intercept). The first 5 rows of X are shown in
The second application example uses a Coordinate Descent (CD) algorithm for Quantile Regression, which is particularly advantageous for cases with fewer datapoints (e.g. <1000). The coordinate descent algorithm provides an exact solution to the regularized quantile regression problem.
The general idea of the coordinate descent algorithm is to optimize a given cost function like Cτ(β) (see Equ. 7) of a p dimensional argument β=(β1, β2, . . . , βp) by optimizing for one parameter βj at coordinate j (j typically chosen randomly) while keeping all the other parameters βl, l≠j fixed. Hence, the optimization problem is reduced to a series of univariate optimization problems and has in general a runtime complexity of O(m·p·U(n)) where U(n) is the runtime complexity of the univariate optimization problem, and m being the number of iterations through the number of variables p, which is typically proportional to p itself.
This boils down to find the minimum of Cτ(β1, . . . , βj, . . . , βp) with respect to βj at coordinate j=1, . . . , p. Hence Cτ(β1, . . . , βj, . . . , βp) can be written as a function of β (dropping the subscript j from βj and superscript τ from J and C for brevity) as
where B−j=Σl≠jxilβl and R−j are constants not depending on β when minimizing C(β). The additive constant R−j stemming from the regularization terms does not depend on β can and will be assumed to be 0. Also the point loss function Lτ becomes a function of β and can be rewritten (omitting superscript τ for brevity and adding subscript i to denote the dependency on the data point i) as Li(β)=Lτ(rij,xijβ) with the residuals rij=yi−ŷ+xijβ. So we get for the univariate loss function to be minimized
It can be seen that Li(β)≥0 for all values of β and Li(ki)=0 for the knot
From this observation it is easy to see that C(β) is the sum of n+2 non-negative and convex functions: n piece-wise linear terms Li(β), a quadratic term λ2β2 and the piece-wise linear term λ1|β| with its discontinuity knot k0=0. Hence, C(β) is also a non-negative and convex function with discontinuities at the knots ki, i=0, . . . , n. Furthermore the function
is linear between two consecutive knots ki
However, by sorting the knots ki such that k0<k1< . . . <kn−1<kn and applying some algebra one can derive a constant time update equation C1(ki+1)=C1(ki)+(ki+1−ki)·Si+1 where Si+1 is the slope of C1(⋅) between ki and ki+1. In the following set of update equations x(i)j and r(i)j denote the values from which the knot
has been computed.
The initial value of cost function at knot k0 is
The left slope of C1(ki) is
if ki originates from the L1-regularization term λ1|β|.
The right slope of C1(ki) is
if ki originated from the L1-regularization term λ1|β|.
Slope left of k0:S0=Σi=0nαi
Computation to get right slope between ki and ki+1 yields
Finally, the update for C1 is:
With the help of these equations of β*i can be computed efficiently in O(n log n) time yielding an overall time complexity of CD of O(p2 n log n), which is now suited for our needs.
The minimizer β* of C(β)=C1(β)+λ2β2 in the case of λ2>0 must not be exactly at one of the knots ki but hast to be in a closed interval [kl, kl+1] for which the right slope of C(⋅) at kl is negative and the left slope of C(⋅) at kl+1 is positive (this is the definition of how to bracket a minimum of a function). Hence, for λ2>0 one just has to search for I such that the above conditions are fulfilled, check if either C(kl) or C(kl+1) is the minimum and if not solve the equation
(for which an analytic solution exists) in the interval [kl, kl+1]. Hence, also in the case λ2>0 this approach yields an exact algorithm to solve the regularized univariate quantile regression problem with a runtime complexity of O(n log n).
The mathematics presented above will be applied to our 2nd application example. The calculation is done for τ=0.25 only, λ2=0.001 and λ1=0.1. We use a convergence tolerance of 0.001.
The coordinate descent example starts with an initial solution of β=(0, 0, . . . , 0) and then applies the above-described steps until convergence has been reached. To application example can be reproduced by the following three Python functions:
During the first iteration the target for j=0 equals the time series y itself as all values of β are 0.0. Hence, the univariate quantile regression gets called with the following arguments:
This leads to a this set of sorted knots ki, left slopes αi, right slopes bi and costs C1(ki):
Then for j=1 the situation has already changed as the target is not equal to y anymore as β0=0.883. In particular for iteration j=1 the univariate quantile regression gets called with the following arguments:
which in turn leads to the following sorted knots ki, left slopes αi, right slopes bi and costs C1(ki):
After the univariate quantile regression has been executed for each j=0, . . . , 20 in iteration 1, β has assumed the following value:
For this value of β we observe delta=0.883 a quantile loss of 0.506 and costs of C(β)=0.769, an L1 norm of β of Σj=020|βj|=2.62 and an L2 norm of 1.13.
After another 21 executions of the univariate quantile regression at the end of iteration 2 we get:
For this value of β we observe delta=0.107 a quantile loss of 0.426 and costs of C(β)=0.689, an L1-norm of 2.63, and an L2-norm of 0.964. Note how the loss, the costs and delta have been reduced during the second iteration. The following table shows how these quantities evolve as iteration after iteration is executed until delta<0.001:
After these 33 iterations the final solution for β by CD is:
Note that with the CD algorithm a quantile loss of 0.323 has been achieved which is similar as for RSILS algorithm. However, the main difference is that we are able to correctly compute an L1 and L2 regularized solution. This can be seen by the fact that 6 out of the 21 regression parameters are exactly zero (in contrast to RSILS where all 21 parameters are non-zero).
Using the three functions for the CD algorithm above, the optimized parameter vector βq25 for the 25% quantile τ=0.25 can be calculated by the following Python code:
In an analogue manner, the optimized parameter vectors βq50, and βq75 for the 50% and 75% quantiles can be calculated.
The resulting optimized parameter vectors
The parameter vectors β for the 25%, 50% and 75% quantile values of the forecast time series—as in the case of the RSILS algorithm—can be used both for anomaly detection and to forecast multiple values of the time series ŷ in future.
Although the code for re-working the above application examples was run on Python 3.11, the claimed methods or algorithms are neither limited to Python, nor to the number or level of quantiles used in the application examples. Furthermore, the number of paths and the prediction horizon can be adapted freely. Finally, the above examples consider only autoregressive features. Instead or in addition to autoregressive features, date and time features as well as seasonal features can be considered in the feature matrix. As the addition of such features is well known in the art, methods involving these features shall be covered by the disclosure.
The main steps performed in anomaly detection are shown in
The steps performed in anomaly detection are also shown in
Block 350 predicts the low and high quantile forecasts by considering yn−1 . . . yn−20 (n=20 time lags assumed) and the parameter vectors βlow and βhigh. The upper and lower threshold values are calculated by block 360 in functional relation to the low and high quantile forecasts. The blocks 370 and 380 compare yn to the upper (block 370) and lower (block 380) threshold. If yn>yhighT or yn<ylowT then yn is marked as a potential anormal time value.
The main steps performed in forecasting a time series of application monitoring data, observability data and data observability data are shown in
The techniques described herein may be implemented by one or more computer programs executed by one or more processors. The computer programs include processor-executable instructions that are stored on a non-transitory tangible computer readable medium. The computer programs may also include stored data. Non-limiting examples of the non-transitory tangible computer readable medium are nonvolatile memory, magnetic storage, and optical storage.
Some portions of the above description present the techniques described herein in terms of algorithms and symbolic representations of operations on information. These algorithmic descriptions and representations are the means used by those skilled in the data processing arts to most effectively convey the substance of their work to others skilled in the art. These operations, while described functionally or logically, are understood to be implemented by computer programs. Furthermore, it has also proven convenient at times to refer to these arrangements of operations as modules or by functional names, without loss of generality.
Unless specifically stated otherwise as apparent from the above discussion, it is appreciated that throughout the description, discussions utilizing terms such as “processing” or “computing” or “calculating” or “determining” or “displaying” or the like, refer to the action and processes of a computer system, or similar electronic computing device, that manipulates and transforms data represented as physical (electronic) quantities within the computer system memories or registers or other such information storage, transmission or display devices.
Certain aspects of the described techniques include process steps and instructions described herein in the form of an algorithm. It should be noted that the described process steps and instructions could be embodied in software, firmware or hardware, and when embodied in software, could be downloaded to reside on and be operated from different platforms used by real time network operating systems.
The present disclosure also relates to an apparatus for performing the operations herein. This apparatus may be specially constructed for the required purposes, or it may comprise a computer selectively activated or reconfigured by a computer program stored on a computer readable medium that can be accessed by the computer. Such a computer program may be stored in a tangible computer readable storage medium, such as, but is not limited to, any type of disk including floppy disks, optical disks, CD-ROMs, magnetic-optical disks, read-only memories (ROMs), random access memories (RAMs), EPROMs, EEPROMs, magnetic or optical cards, application specific integrated circuits (ASICs), or any type of media suitable for storing electronic instructions, and each coupled to a computer system bus. Furthermore, the computers referred to in the specification may include a single processor or may be architectures employing multiple processor designs for increased computing capability.
The algorithms and operations presented herein are not inherently related to any particular computer or other apparatus. Various systems may also be used with programs in accordance with the teachings herein, or it may prove convenient to construct more specialized apparatuses to perform the required method steps. The required structure for a variety of these systems will be apparent to those of skill in the art, along with equivalent variations. In addition, the present disclosure is not described with reference to any particular programming language. It is appreciated that a variety of programming languages may be used to implement the teachings of the present disclosure as described herein.
The foregoing description of the embodiments has been provided for purposes of illustration and description. It is not intended to be exhaustive or to limit the disclosure. Individual elements or features of a particular embodiment are generally not limited to that particular embodiment, but, where applicable, are interchangeable and can be used in a selected embodiment, even if not specifically shown or described. The same may also be varied in many ways. Such variations are not to be regarded as a departure from the disclosure, and all such modifications are intended to be included within the scope of the disclosure.
This application claims the benefit of U.S. Provisional Application No. 63/445,075, filed on Feb. 13, 2023. The entire disclosure of the above application is incorporated herein by reference.
Number | Date | Country | |
---|---|---|---|
63445075 | Feb 2023 | US |