System and Method for Anomaly Detection

Information

  • Patent Application
  • 20150333998
  • Publication Number
    20150333998
  • Date Filed
    May 15, 2014
    10 years ago
  • Date Published
    November 19, 2015
    9 years ago
Abstract
In one embodiment, a method of determining whether a metric is an anomaly includes receiving a data point and determining a metric in accordance with the data point and a center value. The method also includes determining whether the metric is below a lower threshold, between the lower threshold and an upper threshold, or above the upper threshold and determining that the data point is not the anomaly when the metric is below the lower threshold. Additionally, the method includes determining that the data point is the anomaly when the metric is above the upper threshold and determining that the data point might be the anomaly when the metric is between the lower threshold and the upper threshold.
Description
TECHNICAL FIELD

The present invention relates to a system and method for wireless communications, and, in particular, to a system and method for anomaly detection.


BACKGROUND

In network elements of a radio access network, such as base stations (or NodeBs or eNodeBs or cells) or radio network controllers (RNCs) of a cellular system, anomalies occur every now and then. Examples of anomalies include cell outage (e.g., sleeping cell), which may be indicated by key performance indicators (KPIs) with unusually poor (low or high) values. Anomalies may also occur in the form of unusual or broken relationships or correlations observed between sets of variables. It is desirable for anomalies to be rapidly detected while minimizing false alarms.


An anomaly has a root cause, such as malfunctioning user equipment (UE) or network element, interference, resource congestion from heavy traffic, in particular the bottleneck may be the downlink bandwidth, uplink received total wideband power, downlink bandwidth (codes or resource blocks), uplink bandwidth (resource blocks), backhaul bandwidth, channel elements (CE), control channel resources, etc. It is desirable to determine the root cause of an anomaly.


SUMMARY

An embodiment method of determining whether a metric is an anomaly includes receiving a data point and determining a metric in accordance with the data point and a center value. The method also includes determining whether the metric is below a lower threshold, between the lower threshold and an upper threshold, or above the upper threshold and determining that the data point is not the anomaly when the metric is below the lower threshold. Additionally, the method includes determining that the data point is the anomaly when the metric is above the upper threshold and determining that the data point might be the anomaly when the metric is between the lower threshold and the upper threshold.


An embodiment method of root cause analysis includes traversing a soft decision tree, where the soft decision tree includes a plurality of decision nodes and a plurality of root cause nodes. Traversing the soft decision tree includes determining a first plurality of probabilities that the plurality of decision nodes indicate an event which is an anomaly and determining a second plurality of probabilities of the plurality of root causes in accordance with the first plurality of probabilities.


An embodiment computer for detecting an anomaly includes a processor and a computer readable storage medium storing programming for execution by the processor. The programming includes instructions to receive a data point and determine a metric in accordance with the data point and a center value. The programming also includes instructions to determine whether the metric is less than a lower threshold, between the lower threshold and an upper threshold, or greater than the upper threshold and determine that the data point is not the anomaly when the metric is less than the lower threshold. Additionally, the programming includes instructions to determine that the data point is the anomaly when the metric is greater than the upper threshold and determine that the data point might be the anomaly when the metric is between the lower threshold and the upper threshold.


The foregoing has outlined rather broadly the features of an embodiment of the present invention in order that the detailed description of the invention that follows may be better understood. Additional features and advantages of embodiments of the invention will be described hereinafter, which form the subject of the claims of the invention. It should be appreciated by those skilled in the art that the conception and specific embodiments disclosed may be readily utilized as a basis for modifying or designing other structures or processes for carrying out the same purposes of the present invention. It should also be realized by those skilled in the art that such equivalent constructions do not depart from the spirit and scope of the invention as set forth in the appended claims.





BRIEF DESCRIPTION OF THE DRAWINGS

For a more complete understanding of the present invention, and the advantages thereof, reference is now made to the following descriptions taken in conjunction with the accompanying drawing, in which:



FIG. 1 illustrates an embodiment wireless network for communicating data;



FIG. 2 illustrates a flowchart for an embodiment method of anomaly detection;



FIGS. 3A-B illustrate example probability density functions;



FIG. 4 illustrates a probability density function with example data points;



FIG. 5 illustrates a flowchart for another embodiment method of anomaly detection;



FIGS. 6A-B illustrate an example probability density function;



FIG. 7 illustrates example data;



FIG. 8 illustrates an example histogram with inner, middle, and outer bands;



FIG. 9 illustrates example inner, middle, and outer bands;



FIG. 10 illustrates a graph of example data samples over time;



FIG. 11 illustrates a flowchart of an additional embodiment method of anomaly detection;



FIG. 12 illustrates an example of a hard decision tree;



FIG. 13 illustrates an example of a soft decision tree;



FIG. 14 illustrates an example probability function;



FIG. 15 illustrates a flowchart for an embodiment method of root cause analysis; and



FIG. 16 illustrates a block diagram of an embodiment general-purpose computer system.





Corresponding numerals and symbols in the different figures generally refer to corresponding parts unless otherwise indicated. The figures are drawn to clearly illustrate the relevant aspects of the embodiments and are not necessarily drawn to scale.


DETAILED DESCRIPTION OF ILLUSTRATIVE EMBODIMENTS

It should be understood at the outset that although an illustrative implementation of one or more embodiments are provided below, the disclosed systems and/or methods may be implemented using any number of techniques, whether currently known or in existence. The disclosure should in no way be limited to the illustrative implementations, drawings, and techniques illustrated below, including the exemplary designs and implementations illustrated and described herein, but may be modified within the scope of the appended claims along with their full scope of equivalents.


An embodiment method detects anomalies and the root causes of the anomalies. Examples of root causes of anomalies include malfunctioning user equipment (UE) or network elements, interference, resource congestion from heavy traffic. In particular, the bottleneck may be a downlink power, uplink received total wideband power, downlink bandwidth (codes or resource blocks), uplink bandwidth (resource blocks), backhaul bandwidth, channel elements, (CE), control channel resources, etc. It is desirable to detect and determine the root cause of an anomaly.


Some anomaly detection methods select thresholds on variables or distance metrics yielding a decision boundary based on the structure of training data to determine that outliers are represent anomalies. However, selection of the threshold often trades off false alarms with missed anomalies and detection time. An embodiment method uses two thresholds levels to detect anomalies. When the data point is below the lower threshold, it is determined to not indicate an anomaly. When the data point is above the higher threshold, it is determined to indicate an anomaly. When the data point is between the thresholds, the history is used to determine whether the data point indicates an anomaly.


The root cause of a detected anomaly is determined in an embodiment method. A hard decision tree may be used. The decision tree may be created by an expert or learned from the data. However, hard detection trees may lead to an unknown diagnosis of the root cause or to a misdiagnosis. An embodiment method uses a soft decision tree to one or more likely root causes of an anomaly by mapping metric into a probability via the logistic function. Then, the probabilities are multiplied together invoking the naive Bayes assumption of independent attributes. As a result, the most likely root cause or top few likely root causes are determined, along with a probability or confidence measure for each cause.



FIG. 1 illustrates wireless network 100 for wireless communications. Network 100 includes radio network controllers (RNCs) 108 which communicate with each other. RNCs 108 are coupled to communications controllers 102. A plurality of user equipments (UEs) 104 are coupled to communications controllers 102. Communications controllers 102 may be any components capable of providing wireless access by, inter alia, establishing uplink and/or downlink connections with UEs 104, such as base stations, an enhanced base stations (eNBs), access points, picocells, femtocells, and other wirelessly enabled devices. UEs 104 may be any component capable of establishing a wireless connection with communications controllers 102, such as cell phones, smart phones, tablets, sensors, etc. In some embodiments, the network 100 may include various other wireless devices, such as relays, femtocells, etc. Embodiments may detect anomalies on a network, such as network 100. Also, the network may determine the root cause of a detected anomaly.



FIG. 2 illustrates flowchart 110 for a method of detecting anomalies. Training data is stored in block 112. The training data is historical data. In block 114 individual features are modeled. For example, a key performance index (KPI) is examined for each feature.


Probability density functions may be used to determine whether a data point is likely to be an anomaly. Data points close to the center are likely to not indicate anomalies, while data points in the tails are likely to indicate anomalies. The center may be the mean, median, or another value indicating the center of the expected values. FIGS. 3A-B illustrate probability density functions 120 and 130, respectively. Probability density function 120 has a narrow peak and a low variance, while probability density function 130 has a wider peak and a larger variance. The mean of a probability density function is given by:








μ
j

=


1
m






i
=
1

m



x
j

(
i
)





,




and the variance is given by:







σ
j
2

=


1
m






i
=
1

m





(


x
j

(
i
)


-

μ
j


)

2

.







In block 116, a detection algorithm detects anomalies based on new observations 118 using the modeled data. FIG. 4 illustrates probability density function 142 with data points 144 in the normal range of probability density function 142, and data points 146 in the tail of probability density function 142, which are likely anomalies. The probability to determine that new observations are not anomalies is given by:







p


(
x
)


=





j
=
1

n



p


(



x
j

;

μ
j


,

σ
j
2


)



=




j
=
1

n




1



2

π




σ
j









-


(


x
j

-

μ
j


)

2



2


σ
j
2




.








Abnormal validation data points determine the probability of the data being normal. A new input xi, is predicted to be anomalous if:






p(xi)<ε,


where c is a threshold learned, for example, from historical data.



FIG. 5 illustrates flowchart 160 for a method of detecting anomalies with multiple variables. Multiple KPI behaviors at the same time are modeled in block 164 using training data from block 162. FIGS. 6A-B illustrates graphs 170 and 180 of a two dimensional probability density function. The variables x1 and x2 are independent variables. The relationships among KPIs are learned. The mean is given by:






μ
=


1
m






i
=
1

m




x

(
i
)


.







Also, the variance is given by:






Σ
=


1
m






i
=
1

m




(


x

(
i
)


-
μ

)





(


x

(
i
)


-
μ

)

T

.








A detection algorithm in block 166 detects anomalies from new observations from block 168 using modeled data. FIG. 7 illustrates a graph of existing data points 192 and new data points xA, xB, xC, and xD. These new data points are on the outskirts of the training data and are likely anomalies, especially xA. In this example, one variable is central processing unit (CPU) load and the other variable is memory use. The Mahalanobis distance and probability are calculated. The probability to determine that the new observation is not an anomaly is given by:







p


(


x
;
μ

,
Σ

)


=


1



(

2

π

)


n
/
2






Σ



1
/
2









(


-

1
2





(

x
-
μ

)

T




Σ

-
1




(

x
-
μ

)



)


.






The covariance matrix and the shape of the angle are used. Abnormal validation datasets are used to determine c, the threshold. A new input xi is predicted to be anomalous when:






p(xi)<ε.


An embodiment uses an inner band, a middle band, and an outer band of a probability density function to detect an anomaly. FIG. 8 illustrates graph 200 with inner band 206, middle band 208, and outer band 210 for probability density histogram 202, a one dimensional probability density. Curve 204 illustrates an example of a single threshold which may be used to detect anomalies. Data points corresponding to a frequency above the threshold are determined to not be anomalies, and values below the threshold are determined to be anomalies. Data points in the inner band are determined to not be anomalies, while data points in the outer band are determined to be anomalies. Data points in the middle band may or may not be determined to be anomalies. There is a lower threshold between the inner and middle bands and an upper threshold between the middle and outer bands.



FIG. 9 illustrates graph 220 with inner band 226, middle or donut band 224, and outer band 222 for a bivariate two dimensional metric space Gaussian example. Three dimensional or n dimensional examples may be used. As in the one dimensional case, there is a lower threshold between the inner band and the middle band, and an upper threshold between the middle band and the outer band. The inner threshold and the outer threshold are points of equal distance elliptical contours. The Mahalanobis distance may be used as a metric for the multi-dimensional case. Prior weighted average Mahalanobis distances may be used when the data is structured as a mixture of Gaussian clusters. The parameters of each cluster mode are learned.


The lower and upper thresholds are derived from historical or training data for the KPI sets. Highly fluctuating KPIs, for example high variance or heavy tail KPIs, such as packet switching (PS) throughput, have a wider distance between the lower and upper threshold. The lower and upper thresholds are closer together for more stable KPIs, such as security mode command failures. In one example, a user selects the sensitivity. When the user increases the sensitivity, more anomalies will be detected at the expense of more false positives. With a higher sensitivity, the donut region shrinks. A user may also select a sacrosanct KPI expectation. When a KPI passes above this absolute threshold, regardless of the degree of deviation from normal, an alarm is raised. Thus, the user selects the upper threshold.


When a metric, for example the Mahalanobis distance of a vector of metrics passes beyond the upper threshold to the outer band, an alarm is raised. When the metric passes within the lower threshold to the inner band no alarm is raised, and the alarm is turned off if it were previously on. Also, the delay window timer is reset. When the observed metric enters the middle band or donut region between the lower and upper thresholds, the delay window timer is set. If the observed metric is still in the middle band when the delay window timer expires, an alarm is raised. In one example, the delay window is a fixed value. Alternatively, the delay window depends on the trend of the observed metric. If the value continues to get worse, the alarm is raised earlier.



FIG. 10 illustrates graph 230 of metric 232 over time. Initially, metric 232 is in the inner band, and there is a low probability of an anomaly. Metric 232 enters the donut region, and delay window timer 234 is set. The metric does not reach the upper threshold. However, because the metric is still in the donut band when the delay window timer expires, an alarm is raised. If the metric returned to the inner band before the delay window timer expires, no alarm is raised, and the delay window timer is reset. If the metric again enters the donut band from the inner band, the delay window timer is set again. Operation is similar for two or more variables.


To minimize false alarms and missed detection of anomalies, a wider range between the lower and upper threshold, yielding a larger middle band, may be used. The observations are more likely to stay between these bounds, serving as safety guards. The alarm may be triggered based on the consistency and trend over the delay window. This takes more time, but produces fewer false alarms. Alarms may be more obvious, for example, at the cell level than the RNC level. At the RNC level the aggregate of cells is more stable.



FIG. 11 illustrates flowchart 240 for a method of detecting anomalies. Initially, in step 242, upper threshold and lower threshold are determined. This is done based on historical or training data. Values which are well within the normal region are below the lower threshold, values which are well outside the normal region are outside the upper threshold, and intermediate values are between the lower threshold and the upper threshold. In one example, a user hard sets the upper threshold. The span between the lower threshold and the upper threshold are set to trade off the sensitivity and false alarm rate with detection time. A larger distance between the lower threshold and upper threshold increases the sensitivity and decreases the false alarm rate at the expense of detection time. The size of the delay window may also be set in step 242. In one example, these values are initially set before receiving data. In another example, these values are periodically updated based on performance.


Next, in step 244, a data point is received. The data point may be a value in a cellular system, or another system.


Then, in step 246, the system determines whether the data point is in the inner band, the middle band or the outer band. When the data point is in the outer band, the system proceeds to step 246 and an alarm is raised. The alarm may trigger root cause analysis. The system may also return to step 244 to receive the next data point.


When the data point is in the inner band, the system proceeds to step 250. No alarm is raised, and the alarm is reset if it were previously raised. Also, the delay window timer is reset if it were previously set. The system then proceeds to step 244 to receive the next data point.


When the data point is in the middle band, the system proceeds to step 254. In step 254, the system determines whether the delay window timer has previously been set. When the delay window timer has not previously been set, the system is just entering the middle band, and proceeds to step 256.


In step 256, the system sets the delay window timer. Then, it proceeds to step 244 to receive the next data point.


When the delay window timer has been previously set, the system proceeds to step 258, where it determines whether the delay window timer has expired. When the delay window timer has not expired, the system proceeds to step 244 to receive the next data point. When the delay window timer has expired, the system proceeds to step 246 to raise an alarm. The system may also consider other factors in deciding to set the alarm, such as the trend. When the data point is trending closer to the upper threshold, an alarm may be raised earlier.


After an anomaly is detected, it is desirable to determine the root cause of the anomaly. FIG. 12 illustrates decision tree 290 for determining a root cause of an anomaly. Some examples of root causes are that the UE is in a coverage hole, there is a big truck blocking coverage of a UE, or that there is a software bug in a UE operating system or in the cellular network. The decision tree is generated from engineering experience or mined from labeled historic data, and may be modified when new causes of anomalies are detected. Decision tree 290 is a hard decision tree, where a path is chosen at each node. Also, decision tree 290 is hierarchical, with drilling down to lower levels performed by special metric sets and/or network nodes. There are three decision levels of decision tree 290: first RNC level 400, the root level, second RNC level 402, and cell level 404. Causes level 406 contains the leaves. A tree node for event Eij acts on a set of test metrics Sij by computing a learned non-linear function, such as a Mahalanobis distance. The tree node compares the output of the function against a threshold to determine a yes or no hard decision.


Initially, in node 292, the system determines whether anomaly event E1 occurred. General metric set S1 is determined, and is compared to a threshold. For example, when the metric is greater than the threshold, anomaly event E1 occurred, and the system proceeds to node 296, and when the metric is less than or equal to the threshold τ1, anomaly event E1 did not occur, and the system proceeds to node 294.


In node 296, the system determines whether anomaly event E22 occurred. Specific metric S22 is determined. Then, specific metric S22 is compared to a threshold τ22. When the metric is less than the threshold, anomaly event E22 did not occur, and the system proceeds to node 302, where it determines that the anomaly is an unknown problem. This may happen, for example, the first time this anomaly occurs. When the metric is greater than the threshold, anomaly event E22 occurred, and the system proceeds to node 304.


In node 304, the system determines whether anomaly event E33 occurred. Metric S33 is determined, and compared to a threshold τ33. When the metric is less than the threshold, anomaly event E33 has not occurred, and the system looks for other anomaly events to determine the root cause in node 314. On the other hand, when the metric is greater than the threshold, it is determined that anomaly event E33 has occurred. Then, in node 316, the root cause is determined to be RNC and cell problem type Z.


In node 294, the system determines whether anomaly event E21 occurred. Metric S21 is determined, and compared to a threshold τ21. When metric S21 is less than the threshold, it is determined that anomaly event E21 did not occur, and the system proceeds to node 298. On the other hand, when metric S21 is greater than or equal to the threshold, the system proceeds to node 300, determining that anomaly event E21 did occur.


In node 298, the system determines whether anomaly event E31 occurred. Metric S31 is determined, and compared to a threshold τ31. When metric S31 is less than the threshold, it is determined that anomaly event E31 did not occur, and the anomaly is not a problem of type X in node 306. When metric S31 is greater than or equal to the threshold, it is determined that the problem is a cell only problem of type X in node 308.


In node 300, the system determines whether anomaly event E32 occurred. Metric S32 is determined, and compared to a threshold τ32. When metric S32 is less than the threshold, it is determined that anomaly event E32 did not occur, and, in node 310, it is determined to look at other anomaly events for the root cause. When metric S32 is greater than or equal to the threshold, it is determined that the problem is an RNC and cell problem of type Y in node 312.


Decision tree 290 is a binary tree, but a non-binary tree may be used. For example, there may be joint analysis of two events A and B with four mutually exclusive leaves: A and B, A and not B, not A and B, and not A and not B. If A and B arise from different components, then the respective probabilities are multiplied. Likewise, there may be eight leaves for 3 potential events, and so on.



FIG. 13 illustrates soft decision tree 320, which is used to determine the cause of an anomaly. Using a soft decision tree, the probability that a particular problem caused the anomaly may be determined. One or more likely root cause(s) may be determined. The probability that a given node is yes, Pij, is the logistic function operating on the learned function value threshold difference for the output probability of anomaly event Eij. The probability is determined form the distance from the mean. The probability is given by:








P
ij

=

1

1
+



-

(


f


(

x
ij

)


-

τ
ij


)






,




where f(xij) is the learned function, for example the Mahalanobis distance, its argument x is the test vector, and τij is the threshold. FIG. 14 illustrates graph 360, an example logistic function. The probability of a no is 1−Pij. This is because of the mutual exclusivity of events and their complements at the nodes in the tree, which implies that a set of leaves or root causes are mutually exclusive.


Given nodes ij and kl, corresponding to events Eij and Ekl, respectively, the edge weight between the nodes, if it exists on the decision tree, is denoted by (ij, kl). The probabilities are converted to a distance via a transform, so that the edge weight is given by:






D
ij,kl=ln(Pij)


when (ij, kl) exists on a yes branch, and by:






D
ij,kl=ln(1−Pij)


when (ij, kl) exists on a no branch. The edge weight is ∞ otherwise. Dij,kl decreases as Pij increases for a yes edge and as (1−Pij) increases for a no edge. To find the final likelihood of a root cause in a node, Dij,kl is summed along the path to that leaf from the root. The shortest distance path from the root node to one of the leaf nodes is the most likely root cause. Several likely root causes may be considered, along with their likelihood. For example, all root causes with a distance below a threshold may be considered. Alternatively, the two, three, or more of the smallest edge distances are considered. The most likely path or set of events is the argument of the minimum (arg min) path from the root to the leaves, that is:





Σ(ij,kl)εpathDij,kl.


In the naïve Bayes assumption, the additive distances along edges of a candidate path leads to multiplicative probabilities for independent events Eij along the path.


Soft decision tree 320 has root level of first RNC level 346, two intermediate levels, second RNC level 348 and cell level 350, and a root cause level, leaves 352. Initially, in node 322, the system determines the probability P11 that anomaly event E11 occurred. This probability is given by:








P
11

=

1

1
+



-

(


f


(

x
11

)


-

τ
11


)






,




where f(x11) is the learned for the measurement for anomaly event E11, and τ11 is the threshold for event E11. The probability that anomaly event E11 did not occur is given by (1−Pij).


Then, the probabilities are determined for second RNC level 348. The probability P21 that anomaly event E21 occurred is determined in node 324. This probability is given by:








P
21

=

1

1
+



-

(


f


(

x
21

)


-

τ
21


)






,




where f(x21) is the learned for the measurement for anomaly event E21, and τ21 is the threshold for event E21. The probability that anomaly event E21 did not occur is given by (1−P21). Similarly, in node 326, the probability P22 that anomaly event E22 occurred is determined by:








P
22

=

1

1
+



-

(


f


(

x
22

)


-

τ
22


)






,




where f(x22) is the learned for the measurement for anomaly event E22, and τ22 is the threshold for event E22. The probability that anomaly event E22 did not occur is given by (1−P22).


Likewise, the probabilities for the cell level anomalies are determined. The probability P31 that anomaly event E31 occurred is determined in node 328. This probability is given by:








P
31

=

1

1
+



-

(


f


(

x
31

)


-

τ
31


)






,




where f(x31) is the learned for the measurement for anomaly event E31, and τ31 is the threshold for event E31. The probability that anomaly event E31 did not occur is given by (1−P31). Additionally, in node 330, the probability P32 that anomaly event E32 occurred is determined by:








P
32

=

1

1
+



-

(


f


(

x
32

)


-

τ
32


)






,




where f(x32) is the learned for the measurement for anomaly event E32, and τ32 is the threshold for event E32. The probability that anomaly event E32 did not occur is given by (1−P32). Also, in node 334, the probability P33 that anomaly event E33 occurred is determined by:








P
33

=

1

1
+



-

(


f


(

x
33

)


-

τ
33


)






,




where f(x33) is the learned for the measurement for anomaly event E33, and τ33 is the threshold for event E33. The probability that anomaly event E33 did not occur is given by (1−P33).


When the probabilities are calculated, the edge weight distance is determined for the leaves. For example, the edge weight distance for node 336, not a problem of type X, is given by:





−ln(1−P11)−ln(1−P21)−ln(1−P31).


Also, the edge weight distance for node 338, a cell only problem of type X, is given by:





−ln(1−P11)−ln(1−P21)−ln(P31).


Similarly, the edge weight distance for node 340, look at other anomaly events for the root cause, is given by:





−ln(1−P11)−ln(P21)−ln(1−P32).


Additionally, the edge weight distance for node 342, an RNC and cell problem of type Y, is given by:





−ln(1−P11)−ln(P21)−ln(P32).


The edge weight distance for node 332, an unknown problem, is given by:





−ln(P11)−ln(1−P22).


Also, the edge weight distance for node 344, look for other anomaly events for the root cause, is given by:





ln(P11)−ln(P22)−ln(1−P33).


The edge weight distance for node 347, an RNC and cell problem of type Z, is given by:





−ln(P22)−ln(P33).


Correlated events along a path may clarify path discovery by strengthening correlated events and suppressing anti-correlated events when conditional probabilities are used. This is tantamount to joint analysis. As an edge strengthens, and the probability approaches one, its complementary edge weakens, with a probability approaches zero. The multiplication of path edge probabilities causes paths with weak edges to disappear quickly. Spurious path outcomes are still possible with noisy signals. To prevent this, leaves may be removed upfront that are uninteresting for anomaly detection. A few shortest paths, or all the paths that are short enough may be retained for reporting and analysis. Several root causes may be likely from an ambiguous ancestor in the tree.



FIG. 15 illustrates flowchart 370 for a method of determining the root cause of an anomaly. Initially, in step 372, a soft decision tree is created. In one example, the soft decision is created based on engineering experience and previous anomalies. The soft decision tree may be modified as new root causes of anomalies are observed. This may be done automatically or based on user input.


Next, in step 374, an anomaly is detected. For example, anomalies may be detected using a lower threshold and an upper threshold. An anomaly is detected when a metric is above the upper threshold. Also, an anomaly is detected when the metric stays between the lower threshold and the upper threshold for a delay length of time.


Then, in step 376, the probability that an anomaly has occurred. Initially, the probability that the root anomaly occurred is determined. The probability that anomaly Eij occurred is given by:







P
ij

=


1

1
+



-

(


f


(

x
ij

)


-

τ
ij


)





.





Next, in step 378, the system proceeds to the next level. All the nodes at the next level are examined.


In step 380, the system determines whether the first node is a leaf. When the first node is not a leaf, the system proceeds to step 376 to determine the probability an anomaly event occurred for this node. Then, the system goes to step 378 and proceeds to the next level of the tree to examine the children nodes of the first node. When the first node is a leaf, the system proceeds to step 384 to calculate the edge distance of the root causes for the first node. The edge distance for a node representing a root cause is given by the sum of the logs of the probabilities along that path. The edge distances of all the paths are calculated. The root cause with the shortest edge distance, or several root causes, may be selected for further examination.


After step 384, the system proceeds to step 382 to determine whether the second node is a leaf. When the second node is not a leaf, the system proceeds to step 376 to determine the probabilities for the second node. Then, the system proceeds to step 378 to proceed to the children of the second node. When the second node is a leaf, the edge distances of root causes are determined in step 386. The edge distances for the root causes are calculated in step 386. The edge distance for a node representing a root cause is given by the sum of the logs of the probabilities along that path.


The system traverses all branches, so it traverses the entire tree. The edge distances of all the paths are calculated. The root cause with the shortest edge distance, or several root causes, may be selected for further examination.



FIG. 16 illustrates a block diagram of processing system 270 that may be used for implementing the devices and methods disclosed herein. Specific devices may utilize all of the components shown, or only a subset of the components, and levels of integration may vary from device to device. Furthermore, a device may contain multiple instances of a component, such as multiple processing units, processors, memories, transmitters, receivers, etc. The processing system may comprise a processing unit equipped with one or more input devices, such as a microphone, mouse, touchscreen, keypad, keyboard, and the like. Also, processing system 270 may be equipped with one or more output devices, such as a speaker, a printer, a display, and the like. The processing unit may include central processing unit (CPU) 274, memory 276, mass storage device 278, video adapter 280, and I/O interface 288 connected to a bus.


The bus may be one or more of any type of several bus architectures including a memory bus or memory controller, a peripheral bus, video bus, or the like. CPU 274 may comprise any type of electronic data processor. Memory 276 may comprise any type of system memory such as static random access memory (SRAM), dynamic random access memory (DRAM), synchronous DRAM (SDRAM), read-only memory (ROM), a combination thereof, or the like. In an embodiment, the memory may include ROM for use at boot-up, and DRAM for program and data storage for use while executing programs.


Mass storage device 278 may comprise any type of storage device configured to store data, programs, and other information and to make the data, programs, and other information accessible via the bus. Mass storage device 278 may comprise, for example, one or more of a solid state drive, hard disk drive, a magnetic disk drive, an optical disk drive, or the like.


Video adaptor 280 and I/O interface 288 provide interfaces to couple external input and output devices to the processing unit. As illustrated, examples of input and output devices include the display coupled to the video adapter and the mouse/keyboard/printer coupled to the I/O interface. Other devices may be coupled to the processing unit, and additional or fewer interface cards may be utilized. For example, a serial interface card (not pictured) may be used to provide a serial interface for a printer.


The processing unit also includes one or more network interface 284, which may comprise wired links, such as an Ethernet cable or the like, and/or wireless links to access nodes or different networks. Network interface 284 allows the processing unit to communicate with remote units via the networks. For example, the network interface may provide wireless communication via one or more transmitters/transmit antennas and one or more receivers/receive antennas. In an embodiment, the processing unit is coupled to a local-area network or a wide-area network for data processing and communications with remote devices, such as other processing units, the Internet, remote storage facilities, or the like.


While several embodiments have been provided in the present disclosure, it should be understood that the disclosed systems and methods might be embodied in many other specific forms without departing from the spirit or scope of the present disclosure. The present examples are to be considered as illustrative and not restrictive, and the intention is not to be limited to the details given herein. For example, the various elements or components may be combined or integrated in another system or certain features may be omitted, or not implemented.


In addition, techniques, systems, subsystems, and methods described and illustrated in the various embodiments as discrete or separate may be combined or integrated with other systems, modules, techniques, or methods without departing from the scope of the present disclosure. Other items shown or discussed as coupled or directly coupled or communicating with each other may be indirectly coupled or communicating through some interface, device, or intermediate component whether electrically, mechanically, or otherwise. Other examples of changes, substitutions, and alterations are ascertainable by one skilled in the art and could be made without departing from the spirit and scope disclosed herein.

Claims
  • 1. A method of determining whether a metric is an anomaly, the method comprising: receiving a data point;determining the metric in accordance with the data point and a center value;determining whether the metric is the anomaly by: determining that the data point is not the anomaly when the metric is below a lower threshold;determining that the data point is the anomaly when the metric is above an upper threshold; anddetermining that the data point might be the anomaly when the metric is between the lower threshold and the upper threshold.
  • 2. The method of claim 1, wherein determining that the data point might be the anomaly comprises: determining a first length of time that a plurality of data points comprising the data point have been between the lower threshold and the upper threshold;comparing the first length of time to a second length of time of a delay window; anddetermining that the data point is the anomaly when the first length of time is greater than or equal to the second length of time of the delay window.
  • 3. The method of claim 2, wherein determining that the data point might be the anomaly further comprises: determining whether a delay window timer has been set; andsetting the delay window timer when the delay window timer has not been set.
  • 4. The method of claim 2, further comprising releasing a delay window timer when the metric is below the lower threshold.
  • 5. The method of claim 1, wherein determining the metric comprises determining a Mahalanobis distance between the data point and the center value.
  • 6. The method of claim 1, further comprising: determining the lower threshold; anddetermining the upper threshold.
  • 7. The method of claim 6, further comprising receiving a sensitivity level, wherein determining the lower threshold comprises determining the lower threshold in accordance with the sensitivity level, and wherein determining the upper threshold comprises determining the upper threshold in accordance with the sensitivity level.
  • 8. The method of claim 6, wherein determining the upper threshold comprises: receiving a user input; andsetting the upper threshold to the user input.
  • 9. The method of claim 1, further comprising determining a probabilistic root cause of the anomaly when the data point is determined to be the anomaly.
  • 10. The method of claim 9, wherein determining the probabilistic root cause of the anomaly comprises traversing a soft decision tree.
  • 11. A method of root cause analysis, the method comprising traversing a soft decision tree, wherein the soft decision tree comprises a plurality of decision nodes and a plurality of root cause nodes, wherein traversing the soft decision tree comprises: determining a first plurality of probabilities that the plurality of decision nodes indicate an event which is an anomaly; anddetermining a second plurality of probabilities of the plurality of root causes in accordance with the first plurality of probabilities.
  • 12. The method of claim 11, wherein determining a first probability of the first plurality of probabilities comprises calculating the first probability to be
  • 13. The method of claim 11, wherein determining the second plurality of probabilities comprises determining a plurality of edge weights in accordance with first plurality of probabilities.
  • 14. The method of claim 13, wherein determining the plurality of edge weights comprises calculating negatives of the natural logarithm of the first plurality of probabilities and negatives of the natural logarithm of one minus the first plurality of probabilities.
  • 15. The method of claim 13, wherein determining the plurality of edge weights comprises determining a plurality of path distances in accordance with the plurality of edge weights.
  • 16. The method of claim 15, further comprising: determining the minimum of the plurality of path distances to produce a most likely root cause; andtransmitting the most likely root cause.
  • 17. The method of claim 15, further comprising: determining which of the plurality of path distances are below a path threshold to produce a group of likely causes; andtransmitting the group of likely causes.
  • 18. The method of claim 11, further comprising constructing the soft decision tree.
  • 19. The method of claim 18, wherein constructing the soft decision tree comprises constructing the soft decision tree in accordance with user input.
  • 20. The method of claim 18, wherein constructing the soft decision tree comprises performing a machine learning algorithm on a plurality of labels.
  • 21. The method of claim 11, further comprising detecting an initial anomaly.
  • 22. The method of claim 11, wherein determining the second plurality of probabilities of the plurality of root causes comprises determining a probability of a first root cause comprises multiplying a subset of the first plurality of probabilities to determine the probability of the first root cause, wherein the plurality of root causes comprises the first root cause, wherein the subset of the first plurality of probabilities are on a path along the soft decision tree from a root level of the soft decision tree to the first root cause.
  • 23. A computer for detecting an anomaly comprising: a processor; anda computer readable storage medium storing programming for execution by the processor, the programming including instructions to receive a data point,determine a metric in accordance with the data point and a center value,determine whether the metric is less than a lower threshold, between the lower threshold and an upper threshold, or greater than the upper threshold,determine that the data point is not the anomaly when the metric is less than the lower threshold,determine that the data point is the anomaly when the metric is greater than the upper threshold, anddetermine that the data point might be the anomaly when the metric is between the lower threshold and the upper threshold.