The present invention relates to a system and method for wireless communications, and, in particular, to a system and method for anomaly detection.
In network elements of a radio access network, such as base stations (or NodeBs or eNodeBs or cells) or radio network controllers (RNCs) of a cellular system, anomalies occur every now and then. Examples of anomalies include cell outage (e.g., sleeping cell), which may be indicated by key performance indicators (KPIs) with unusually poor (low or high) values. Anomalies may also occur in the form of unusual or broken relationships or correlations observed between sets of variables. It is desirable for anomalies to be rapidly detected while minimizing false alarms.
An anomaly has a root cause, such as malfunctioning user equipment (UE) or network element, interference, resource congestion from heavy traffic, in particular the bottleneck may be the downlink bandwidth, uplink received total wideband power, downlink bandwidth (codes or resource blocks), uplink bandwidth (resource blocks), backhaul bandwidth, channel elements (CE), control channel resources, etc. It is desirable to determine the root cause of an anomaly.
An embodiment method of determining whether a metric is an anomaly includes receiving a data point and determining a metric in accordance with the data point and a center value. The method also includes determining whether the metric is below a lower threshold, between the lower threshold and an upper threshold, or above the upper threshold and determining that the data point is not the anomaly when the metric is below the lower threshold. Additionally, the method includes determining that the data point is the anomaly when the metric is above the upper threshold and determining that the data point might be the anomaly when the metric is between the lower threshold and the upper threshold.
An embodiment method of root cause analysis includes traversing a soft decision tree, where the soft decision tree includes a plurality of decision nodes and a plurality of root cause nodes. Traversing the soft decision tree includes determining a first plurality of probabilities that the plurality of decision nodes indicate an event which is an anomaly and determining a second plurality of probabilities of the plurality of root causes in accordance with the first plurality of probabilities.
An embodiment computer for detecting an anomaly includes a processor and a computer readable storage medium storing programming for execution by the processor. The programming includes instructions to receive a data point and determine a metric in accordance with the data point and a center value. The programming also includes instructions to determine whether the metric is less than a lower threshold, between the lower threshold and an upper threshold, or greater than the upper threshold and determine that the data point is not the anomaly when the metric is less than the lower threshold. Additionally, the programming includes instructions to determine that the data point is the anomaly when the metric is greater than the upper threshold and determine that the data point might be the anomaly when the metric is between the lower threshold and the upper threshold.
The foregoing has outlined rather broadly the features of an embodiment of the present invention in order that the detailed description of the invention that follows may be better understood. Additional features and advantages of embodiments of the invention will be described hereinafter, which form the subject of the claims of the invention. It should be appreciated by those skilled in the art that the conception and specific embodiments disclosed may be readily utilized as a basis for modifying or designing other structures or processes for carrying out the same purposes of the present invention. It should also be realized by those skilled in the art that such equivalent constructions do not depart from the spirit and scope of the invention as set forth in the appended claims.
For a more complete understanding of the present invention, and the advantages thereof, reference is now made to the following descriptions taken in conjunction with the accompanying drawing, in which:
Corresponding numerals and symbols in the different figures generally refer to corresponding parts unless otherwise indicated. The figures are drawn to clearly illustrate the relevant aspects of the embodiments and are not necessarily drawn to scale.
It should be understood at the outset that although an illustrative implementation of one or more embodiments are provided below, the disclosed systems and/or methods may be implemented using any number of techniques, whether currently known or in existence. The disclosure should in no way be limited to the illustrative implementations, drawings, and techniques illustrated below, including the exemplary designs and implementations illustrated and described herein, but may be modified within the scope of the appended claims along with their full scope of equivalents.
An embodiment method detects anomalies and the root causes of the anomalies. Examples of root causes of anomalies include malfunctioning user equipment (UE) or network elements, interference, resource congestion from heavy traffic. In particular, the bottleneck may be a downlink power, uplink received total wideband power, downlink bandwidth (codes or resource blocks), uplink bandwidth (resource blocks), backhaul bandwidth, channel elements, (CE), control channel resources, etc. It is desirable to detect and determine the root cause of an anomaly.
Some anomaly detection methods select thresholds on variables or distance metrics yielding a decision boundary based on the structure of training data to determine that outliers are represent anomalies. However, selection of the threshold often trades off false alarms with missed anomalies and detection time. An embodiment method uses two thresholds levels to detect anomalies. When the data point is below the lower threshold, it is determined to not indicate an anomaly. When the data point is above the higher threshold, it is determined to indicate an anomaly. When the data point is between the thresholds, the history is used to determine whether the data point indicates an anomaly.
The root cause of a detected anomaly is determined in an embodiment method. A hard decision tree may be used. The decision tree may be created by an expert or learned from the data. However, hard detection trees may lead to an unknown diagnosis of the root cause or to a misdiagnosis. An embodiment method uses a soft decision tree to one or more likely root causes of an anomaly by mapping metric into a probability via the logistic function. Then, the probabilities are multiplied together invoking the naive Bayes assumption of independent attributes. As a result, the most likely root cause or top few likely root causes are determined, along with a probability or confidence measure for each cause.
Probability density functions may be used to determine whether a data point is likely to be an anomaly. Data points close to the center are likely to not indicate anomalies, while data points in the tails are likely to indicate anomalies. The center may be the mean, median, or another value indicating the center of the expected values.
and the variance is given by:
In block 116, a detection algorithm detects anomalies based on new observations 118 using the modeled data.
Abnormal validation data points determine the probability of the data being normal. A new input xi, is predicted to be anomalous if:
p(xi)<ε,
where c is a threshold learned, for example, from historical data.
Also, the variance is given by:
A detection algorithm in block 166 detects anomalies from new observations from block 168 using modeled data.
The covariance matrix and the shape of the angle are used. Abnormal validation datasets are used to determine c, the threshold. A new input xi is predicted to be anomalous when:
p(xi)<ε.
An embodiment uses an inner band, a middle band, and an outer band of a probability density function to detect an anomaly.
The lower and upper thresholds are derived from historical or training data for the KPI sets. Highly fluctuating KPIs, for example high variance or heavy tail KPIs, such as packet switching (PS) throughput, have a wider distance between the lower and upper threshold. The lower and upper thresholds are closer together for more stable KPIs, such as security mode command failures. In one example, a user selects the sensitivity. When the user increases the sensitivity, more anomalies will be detected at the expense of more false positives. With a higher sensitivity, the donut region shrinks. A user may also select a sacrosanct KPI expectation. When a KPI passes above this absolute threshold, regardless of the degree of deviation from normal, an alarm is raised. Thus, the user selects the upper threshold.
When a metric, for example the Mahalanobis distance of a vector of metrics passes beyond the upper threshold to the outer band, an alarm is raised. When the metric passes within the lower threshold to the inner band no alarm is raised, and the alarm is turned off if it were previously on. Also, the delay window timer is reset. When the observed metric enters the middle band or donut region between the lower and upper thresholds, the delay window timer is set. If the observed metric is still in the middle band when the delay window timer expires, an alarm is raised. In one example, the delay window is a fixed value. Alternatively, the delay window depends on the trend of the observed metric. If the value continues to get worse, the alarm is raised earlier.
To minimize false alarms and missed detection of anomalies, a wider range between the lower and upper threshold, yielding a larger middle band, may be used. The observations are more likely to stay between these bounds, serving as safety guards. The alarm may be triggered based on the consistency and trend over the delay window. This takes more time, but produces fewer false alarms. Alarms may be more obvious, for example, at the cell level than the RNC level. At the RNC level the aggregate of cells is more stable.
Next, in step 244, a data point is received. The data point may be a value in a cellular system, or another system.
Then, in step 246, the system determines whether the data point is in the inner band, the middle band or the outer band. When the data point is in the outer band, the system proceeds to step 246 and an alarm is raised. The alarm may trigger root cause analysis. The system may also return to step 244 to receive the next data point.
When the data point is in the inner band, the system proceeds to step 250. No alarm is raised, and the alarm is reset if it were previously raised. Also, the delay window timer is reset if it were previously set. The system then proceeds to step 244 to receive the next data point.
When the data point is in the middle band, the system proceeds to step 254. In step 254, the system determines whether the delay window timer has previously been set. When the delay window timer has not previously been set, the system is just entering the middle band, and proceeds to step 256.
In step 256, the system sets the delay window timer. Then, it proceeds to step 244 to receive the next data point.
When the delay window timer has been previously set, the system proceeds to step 258, where it determines whether the delay window timer has expired. When the delay window timer has not expired, the system proceeds to step 244 to receive the next data point. When the delay window timer has expired, the system proceeds to step 246 to raise an alarm. The system may also consider other factors in deciding to set the alarm, such as the trend. When the data point is trending closer to the upper threshold, an alarm may be raised earlier.
After an anomaly is detected, it is desirable to determine the root cause of the anomaly.
Initially, in node 292, the system determines whether anomaly event E1 occurred. General metric set S1 is determined, and is compared to a threshold. For example, when the metric is greater than the threshold, anomaly event E1 occurred, and the system proceeds to node 296, and when the metric is less than or equal to the threshold τ1, anomaly event E1 did not occur, and the system proceeds to node 294.
In node 296, the system determines whether anomaly event E22 occurred. Specific metric S22 is determined. Then, specific metric S22 is compared to a threshold τ22. When the metric is less than the threshold, anomaly event E22 did not occur, and the system proceeds to node 302, where it determines that the anomaly is an unknown problem. This may happen, for example, the first time this anomaly occurs. When the metric is greater than the threshold, anomaly event E22 occurred, and the system proceeds to node 304.
In node 304, the system determines whether anomaly event E33 occurred. Metric S33 is determined, and compared to a threshold τ33. When the metric is less than the threshold, anomaly event E33 has not occurred, and the system looks for other anomaly events to determine the root cause in node 314. On the other hand, when the metric is greater than the threshold, it is determined that anomaly event E33 has occurred. Then, in node 316, the root cause is determined to be RNC and cell problem type Z.
In node 294, the system determines whether anomaly event E21 occurred. Metric S21 is determined, and compared to a threshold τ21. When metric S21 is less than the threshold, it is determined that anomaly event E21 did not occur, and the system proceeds to node 298. On the other hand, when metric S21 is greater than or equal to the threshold, the system proceeds to node 300, determining that anomaly event E21 did occur.
In node 298, the system determines whether anomaly event E31 occurred. Metric S31 is determined, and compared to a threshold τ31. When metric S31 is less than the threshold, it is determined that anomaly event E31 did not occur, and the anomaly is not a problem of type X in node 306. When metric S31 is greater than or equal to the threshold, it is determined that the problem is a cell only problem of type X in node 308.
In node 300, the system determines whether anomaly event E32 occurred. Metric S32 is determined, and compared to a threshold τ32. When metric S32 is less than the threshold, it is determined that anomaly event E32 did not occur, and, in node 310, it is determined to look at other anomaly events for the root cause. When metric S32 is greater than or equal to the threshold, it is determined that the problem is an RNC and cell problem of type Y in node 312.
Decision tree 290 is a binary tree, but a non-binary tree may be used. For example, there may be joint analysis of two events A and B with four mutually exclusive leaves: A and B, A and not B, not A and B, and not A and not B. If A and B arise from different components, then the respective probabilities are multiplied. Likewise, there may be eight leaves for 3 potential events, and so on.
where f(xij) is the learned function, for example the Mahalanobis distance, its argument x is the test vector, and τij is the threshold.
Given nodes ij and kl, corresponding to events Eij and Ekl, respectively, the edge weight between the nodes, if it exists on the decision tree, is denoted by (ij, kl). The probabilities are converted to a distance via a transform, so that the edge weight is given by:
D
ij,kl=ln(Pij)
when (ij, kl) exists on a yes branch, and by:
D
ij,kl=ln(1−Pij)
when (ij, kl) exists on a no branch. The edge weight is ∞ otherwise. Dij,kl decreases as Pij increases for a yes edge and as (1−Pij) increases for a no edge. To find the final likelihood of a root cause in a node, Dij,kl is summed along the path to that leaf from the root. The shortest distance path from the root node to one of the leaf nodes is the most likely root cause. Several likely root causes may be considered, along with their likelihood. For example, all root causes with a distance below a threshold may be considered. Alternatively, the two, three, or more of the smallest edge distances are considered. The most likely path or set of events is the argument of the minimum (arg min) path from the root to the leaves, that is:
Σ(ij,kl)εpathDij,kl.
In the naïve Bayes assumption, the additive distances along edges of a candidate path leads to multiplicative probabilities for independent events Eij along the path.
Soft decision tree 320 has root level of first RNC level 346, two intermediate levels, second RNC level 348 and cell level 350, and a root cause level, leaves 352. Initially, in node 322, the system determines the probability P11 that anomaly event E11 occurred. This probability is given by:
where f(x11) is the learned for the measurement for anomaly event E11, and τ11 is the threshold for event E11. The probability that anomaly event E11 did not occur is given by (1−Pij).
Then, the probabilities are determined for second RNC level 348. The probability P21 that anomaly event E21 occurred is determined in node 324. This probability is given by:
where f(x21) is the learned for the measurement for anomaly event E21, and τ21 is the threshold for event E21. The probability that anomaly event E21 did not occur is given by (1−P21). Similarly, in node 326, the probability P22 that anomaly event E22 occurred is determined by:
where f(x22) is the learned for the measurement for anomaly event E22, and τ22 is the threshold for event E22. The probability that anomaly event E22 did not occur is given by (1−P22).
Likewise, the probabilities for the cell level anomalies are determined. The probability P31 that anomaly event E31 occurred is determined in node 328. This probability is given by:
where f(x31) is the learned for the measurement for anomaly event E31, and τ31 is the threshold for event E31. The probability that anomaly event E31 did not occur is given by (1−P31). Additionally, in node 330, the probability P32 that anomaly event E32 occurred is determined by:
where f(x32) is the learned for the measurement for anomaly event E32, and τ32 is the threshold for event E32. The probability that anomaly event E32 did not occur is given by (1−P32). Also, in node 334, the probability P33 that anomaly event E33 occurred is determined by:
where f(x33) is the learned for the measurement for anomaly event E33, and τ33 is the threshold for event E33. The probability that anomaly event E33 did not occur is given by (1−P33).
When the probabilities are calculated, the edge weight distance is determined for the leaves. For example, the edge weight distance for node 336, not a problem of type X, is given by:
−ln(1−P11)−ln(1−P21)−ln(1−P31).
Also, the edge weight distance for node 338, a cell only problem of type X, is given by:
−ln(1−P11)−ln(1−P21)−ln(P31).
Similarly, the edge weight distance for node 340, look at other anomaly events for the root cause, is given by:
−ln(1−P11)−ln(P21)−ln(1−P32).
Additionally, the edge weight distance for node 342, an RNC and cell problem of type Y, is given by:
−ln(1−P11)−ln(P21)−ln(P32).
The edge weight distance for node 332, an unknown problem, is given by:
−ln(P11)−ln(1−P22).
Also, the edge weight distance for node 344, look for other anomaly events for the root cause, is given by:
ln(P11)−ln(P22)−ln(1−P33).
The edge weight distance for node 347, an RNC and cell problem of type Z, is given by:
−ln(P22)−ln(P33).
Correlated events along a path may clarify path discovery by strengthening correlated events and suppressing anti-correlated events when conditional probabilities are used. This is tantamount to joint analysis. As an edge strengthens, and the probability approaches one, its complementary edge weakens, with a probability approaches zero. The multiplication of path edge probabilities causes paths with weak edges to disappear quickly. Spurious path outcomes are still possible with noisy signals. To prevent this, leaves may be removed upfront that are uninteresting for anomaly detection. A few shortest paths, or all the paths that are short enough may be retained for reporting and analysis. Several root causes may be likely from an ambiguous ancestor in the tree.
Next, in step 374, an anomaly is detected. For example, anomalies may be detected using a lower threshold and an upper threshold. An anomaly is detected when a metric is above the upper threshold. Also, an anomaly is detected when the metric stays between the lower threshold and the upper threshold for a delay length of time.
Then, in step 376, the probability that an anomaly has occurred. Initially, the probability that the root anomaly occurred is determined. The probability that anomaly Eij occurred is given by:
Next, in step 378, the system proceeds to the next level. All the nodes at the next level are examined.
In step 380, the system determines whether the first node is a leaf. When the first node is not a leaf, the system proceeds to step 376 to determine the probability an anomaly event occurred for this node. Then, the system goes to step 378 and proceeds to the next level of the tree to examine the children nodes of the first node. When the first node is a leaf, the system proceeds to step 384 to calculate the edge distance of the root causes for the first node. The edge distance for a node representing a root cause is given by the sum of the logs of the probabilities along that path. The edge distances of all the paths are calculated. The root cause with the shortest edge distance, or several root causes, may be selected for further examination.
After step 384, the system proceeds to step 382 to determine whether the second node is a leaf. When the second node is not a leaf, the system proceeds to step 376 to determine the probabilities for the second node. Then, the system proceeds to step 378 to proceed to the children of the second node. When the second node is a leaf, the edge distances of root causes are determined in step 386. The edge distances for the root causes are calculated in step 386. The edge distance for a node representing a root cause is given by the sum of the logs of the probabilities along that path.
The system traverses all branches, so it traverses the entire tree. The edge distances of all the paths are calculated. The root cause with the shortest edge distance, or several root causes, may be selected for further examination.
The bus may be one or more of any type of several bus architectures including a memory bus or memory controller, a peripheral bus, video bus, or the like. CPU 274 may comprise any type of electronic data processor. Memory 276 may comprise any type of system memory such as static random access memory (SRAM), dynamic random access memory (DRAM), synchronous DRAM (SDRAM), read-only memory (ROM), a combination thereof, or the like. In an embodiment, the memory may include ROM for use at boot-up, and DRAM for program and data storage for use while executing programs.
Mass storage device 278 may comprise any type of storage device configured to store data, programs, and other information and to make the data, programs, and other information accessible via the bus. Mass storage device 278 may comprise, for example, one or more of a solid state drive, hard disk drive, a magnetic disk drive, an optical disk drive, or the like.
Video adaptor 280 and I/O interface 288 provide interfaces to couple external input and output devices to the processing unit. As illustrated, examples of input and output devices include the display coupled to the video adapter and the mouse/keyboard/printer coupled to the I/O interface. Other devices may be coupled to the processing unit, and additional or fewer interface cards may be utilized. For example, a serial interface card (not pictured) may be used to provide a serial interface for a printer.
The processing unit also includes one or more network interface 284, which may comprise wired links, such as an Ethernet cable or the like, and/or wireless links to access nodes or different networks. Network interface 284 allows the processing unit to communicate with remote units via the networks. For example, the network interface may provide wireless communication via one or more transmitters/transmit antennas and one or more receivers/receive antennas. In an embodiment, the processing unit is coupled to a local-area network or a wide-area network for data processing and communications with remote devices, such as other processing units, the Internet, remote storage facilities, or the like.
While several embodiments have been provided in the present disclosure, it should be understood that the disclosed systems and methods might be embodied in many other specific forms without departing from the spirit or scope of the present disclosure. The present examples are to be considered as illustrative and not restrictive, and the intention is not to be limited to the details given herein. For example, the various elements or components may be combined or integrated in another system or certain features may be omitted, or not implemented.
In addition, techniques, systems, subsystems, and methods described and illustrated in the various embodiments as discrete or separate may be combined or integrated with other systems, modules, techniques, or methods without departing from the scope of the present disclosure. Other items shown or discussed as coupled or directly coupled or communicating with each other may be indirectly coupled or communicating through some interface, device, or intermediate component whether electrically, mechanically, or otherwise. Other examples of changes, substitutions, and alterations are ascertainable by one skilled in the art and could be made without departing from the spirit and scope disclosed herein.