The invention relates generally to managing network resources such as in a wireless network and, more specifically but not exclusively, to analyzing attribute change impact within a managed network.
The rapid penetration of smart phones has put tremendous stress on mobile networks resulting in users experiencing poor application performance. Mobile network operators need to understand the root causes of poor network performance so they can take remedial action.
Presently, network operators use one or more of key performance indicators (KPIs) and key quality indicators (KQIs), which may be constructed using event counter data associated with network equipment, protocols, subscribers, applications and the like. For example, Universal Mobile Telecommunications System (UMTS) contemplates the use of thousands of UMTS Terrestrial Radio Access Network (UTRAN) event counters. These counters aggregate radio network information such as handoff events, paging events, physical transmission powers and the like for a fixed time interval. However, the specific impact to performance metrics indicated by event counters is largely unknown.
Various deficiencies of the prior art are addressed by the present invention of a system, method and apparatus for correlating event counter data with cell level Transmission Control Protocol (TCP) performance data.
Various embodiments contemplate a method and system for identifying causes of performance metric changes in a network by selecting, from a pool of network event counters, a plurality of candidate counters relevant to a performance metric; grouping candidate counters into clusters of similar counters; selecting, from each cluster, one or more representative counters; and fitting the selected representative counters to a model of the performance metric to determine thereby a set of representative counters most relevant to the performance metric.
The teachings of the present invention can be readily understood by considering the following detailed description in conjunction with the accompanying drawings, in which:
To facilitate understanding, identical reference numerals have been used, where possible, to designate identical elements that are common to the figures.
Embodiments of the invention will be primarily described within the context of a network management system (NMS) adapted to manage event counter data associated with a Long Term Evolution (LTE) network such as event counter data associated with network elements, communications links, subnets, protocols, services, applications, layers and any other element, object or portion thereof within an LTE network. However, those skilled in the art and informed by the teachings herein will realize that the various embodiments are also applicable to other types of wireless networks (e.g., 2G networks, 3G networks, WiMAX, etc.), wireline networks or combinations of wireless and wireline networks. Thus, the network elements, links, connectors, sites and other objects representing mobile services may identify network elements associated with other types of wireless and wireline networks.
Various embodiments are adapted to identify one or more root causes of recurring user performance problems by correlating UTRAN event counters (EC) with performance metrics such as loss, delay and throughput monitored by a network monitor.
The approximately three thousand (3000) UTRAN event counters taken together provide detailed information on the operating conditions of the network, though not all counters will be associated with identifiable root causes. For example, some important metrics such as Nack.rate, Discard.rate, AirintTput and the like may be strongly correlated to network performance, yet not directly associated with degraded performance root causes.
The following are possible categories of root causes: power budget, signaling overload, Code Division Multiple Access (CDMA) code availability, downlink/uplink Signal to Noise Ratio (SNR), backhaul congestion, handoff/cell selection, cell overload, and the like. It should be noted that the some counters are highly correlated, and so each category of root cause maybe reflected in many counters, though other counters are not well correlated and, therefore, are not as well reflected in various root cause categories.
The exemplary UEs 102 are wireless user devices capable of accessing a wireless network, such as LTE network 110. The UEs 102 are capable of supporting control signaling in support of the bearer session(s). The UEs 102 may be mobile phones, personal digital assistants (PDAs), computers, tablets devices or any other wireless user device.
The exemplary LTE network 110 includes, illustratively, two eNodeBs 1111 and 1112 (collectively, eNodeBs 111), two Serving Gateways (SGWs) 1121 and 1122 (collectively, SGWs 112), a Packet Data Network (PDN) Gateway (PGW) 113, a Mobility Management Entity (MME) 114, and a Policy and Charging Rules Function (PCRF) 115. The eNodeBs 111 provide a radio access interface for UEs 102. The SGWs 112, PGW 113, MME 114, and PCRF 115, as well as other components which have been omitted for purposes of clarity, cooperate to provide an Evolved Packet Core (EPC) network supporting end-to-end service delivery using Internet Protocol (IP).
The eNodeBs 111 support communications for UEs 102. As depicted in
The SGWs 112 support communications for eNodeBs 111 using, illustratively, respective S1-u interfaces between the SGWs 112 and the eNodeBs 111. The 51-u interfaces support per-bearer user plane tunneling and inter-eNodeB path switching during handover.
As depicted in
The PGW 113 supports communications for the SGWs 112 using, illustratively, respective S5/S8 interfaces between PGW 113 and SGWs 112. The S5 interfaces provide functions such as user plane tunneling and tunnel management for communications between PGW 113 and SGWs 112, SGW relocation due to UE mobility, and the like. The S8 interfaces, which may be Public Land Mobile Network (PLMN) variants of the S5 interfaces, provide inter-PLMN interfaces providing user and control plane connectivity between the SGW in the Visitor PLMN (VPLMN) and the PGW in the Home PLMN (HPLMN). The PGW 113 facilitates communications between LTE network 110 and IP networks 130 via an SGi interface.
The MME 114 provides mobility management functions in support of mobility of UEs 102. The MME 114 supports the eNodeBs 111 using, illustratively, respective S1-MME interfaces which provide control plane protocols for communication between the MME 114 and the eNodeBs 111.
The PCRF 115 provides dynamic management capabilities by which the service provider may manage rules related to services provided via LTE network 110 and rules related to charging for services provided via LTE network 110.
As depicted and described herein with respect to
The IP networks 130 include one or more packet data networks via which UEs 102 may access content, services, and the like.
The MS 140 provides management functions for managing the LTE network 110. The MS 140 may communicate with LTE network 110 in any suitable manner. In one embodiment, for example, MS 140 may communicate with LTE network 110 via a communication path 141 which does not traverse IP networks 130. In one embodiment, for example, MS 140 may communicate with LTE network 110 via a communication path 142 which is supported by IP networks 130. The communication paths 141 and 142 may be implemented using any suitable communications capabilities. The MS 140 may be implemented as a general purpose computing device or specific purpose computing device, such as described below with respect to
The processor(s) 210 is adapted to cooperate with the memory 220, the network interface 230N and the user interface 2301 to provide various management functions for LTE network 110.
The memory 220, generally speaking, stores programs, data, tools and the like that are adapted for use by the processor(s) 210 and other modules to provide the various functions described herein. The memory includes a Discovery and Management Engine (DME) 221, a Discovery and Management Database (DMD) 222, a Performance Processing Engine (PPE) 225, a Performance Processing Database (PPD) 224 and various other functions 228.
The DMD 222 and PPD 226 store data which may be generated by and used by various ones and/or combinations of the engines, functions and tools of memory 220. The DMD 222 and PPD 226 may be combined into a single database or implemented as respective databases, memory structures and/or portions thereof. Either of the combined or respective databases may be implemented as single databases or multiple databases in any of the arrangements known to those skilled in the art.
Although depicted and described with respect to an embodiment in which each of the engines and databases are stored within memory 120, it will be appreciated by those skilled in the art that the engines and databases may be stored in one or more other storage devices internal to MS 140 and/or external to MS 140. The engines and databases may be distributed across any suitable numbers and/or types of storage devices internal and/or external to MS 140. The memory 220, including each of the engines and/or databases of memory 220, is described in additional detail herein below.
The network interface 230N is adapted to facilitate communications with LTE network 110. The user interface 2301 is adapted to facilitate communications with one or more user workstations, illustratively user workstation 250 including graphical user interface (GUI) 255, for enabling one or more users to perform management functions for LTE network 110, such as at a network operations center (NOC) or at a remote location.
Discovery and Management Engine
The discovery and management engine (DME) 221 is generally adapted for providing network discovery functions and management functions associated with the LTE network 110. Generally speaking, the DME performs a discovery process in which configuration information, status/operating information and connection information regarding the elements and sub-elements forming the network is gathered, retrieved, inferred and/or generated, as well as a management process in which the various nodes, links and so on forming the network 110 are managed in accordance with the business requirements of the network operator and customers. Data used within the context of the discovery and management functions is stored in, illustratively, discovery and management database 222.
Performance Processing Engine
The performance processing engine (PPE) 225 is generally adapted for providing performance management functions in accordance with the various embodiments. For example, the PPE 225 may be adapted to identify the root causes of performance deficiencies using various types of data received by the discovery and management engine 221 (possibly stored in the discovery and management database 222). For example, in various embodiments, network event counters, alarms, warnings, status updates and the like are aggregated and utilized by the discovery and management engine 221. In various embodiments, the PPE 225 interacts with the DME 221 to process some or all of this data with a view toward identifying root causes of performance deficiencies in the network 110.
The PPE 225 may operate in response to a request from the DME 221 or in an independent or semiautonomous manner. In various embodiments, the DME 221 identifies one or more root causes associated with a specific performance deficiency. In various embodiments, DME 221 identifies one or more root causes associated with multiple performance deficiencies. In various embodiments, root causes associated with one or more performance deficiencies are prioritized in terms of network impact such that a network operator may correct the root causes in a prioritized or ordered manner.
Correlating TCP Performance with Cell Level Event counters
Various embodiments operate to correlate cell level Transmission Control Protocol (TCP) performance data in terms of loss, throughput, delay and the like with cell level event counters. The large problem space associated with numerous cell level event counters is reduced by selectively filtering out less relevant event counters, clustering similar relevant event counters and selecting one or a few event counters per cluster for further processing using classification analysis and/or other techniques to identify root causes of performance deficiencies in the network.
At step 310, a plurality of candidate counters relevant to one or more performance metrics is selected from a pool of network event counters. Referring to box 315, the candidate counters may be selected using one or more of domain knowledge, importance score, minimum threshold level, rank correlation, Komogorov-Smirnov (KS) test or other mechanism, such as discussed in more detail below.
Generally speaking, step 310 operates to reduce the number of event counters to be processed by filtering out those that are less relevant to the performance metric of interest. In this manner, the use of processing, memory and other resources to process irrelevant or less relevant event counters is avoided. Optionally, candidate counters are normalized or otherwise transformed prior to processing to simplify that processing.
At step 320, similar candidate counters are grouped into clusters of counters, such as for each of one or more performance metrics of interest. Referring to box 325, similarity between counters may be identified using a number of techniques, including spectral clustering, cost tree analysis, pair-wise correlation of candidate counters and other techniques. For example, candidate counters exhibiting mutual correlations to a performance metric above a first threshold level (e.g., 0.95) may be considered to be similar. Generally speaking, grouping is performed using statistical clustering techniques such as clustering based on a graphical representation of candidate counters (e.g., spectral clustering, connected components), hierarchical clustering, using pair-wise correlation of candidate counters as similarity score, cost tree analysis and the like.
At step 330, one or more representative counters is selected from each cluster. Referring to box 335, one or more representative counters may be selected according to a largest correlation to a performance metric of interest, correlation above a second threshold level or some other selection criteria.
Generally speaking, steps 320-330 operate to further reduce the number of event counters to be processed by identifying groups of similar counters and selecting one or a few counters from each group, thereby avoiding the further processing of duplicate similar counters.
At step 340, the selected representative counters are fitted to one or more models of one or more performance metrics to determine thereby representative counters most relevant to the one or more performance metrics. In this manner, event counters indicative of fault conditions that are most relevant to performance metrics may be used as a proxy for such performance metrics or in conjunction with the management of such performance metrics by the network management system 140 or other entity associated with the network. In various embodiments, cell level TCP performance data such as loss, throughput, delay or other performance metrics is correlated with various cell level event counters in an efficient manner to improve the ability of network operators to quickly and efficiently address root causes of network problems.
Selection of Candidate Counters
For example, assume that a network operator concerned with one or more network performance metrics Y (e.g., packet loss, packet delay, throughput and the like) receives performance data associated with a plurality of UTRAN counters x. In various embodiments a computation is made of a score between each counter x and each performance metric Y that shows how important a particular counter x is to a particular performance metric Y. If the score is above a predefined correlation threshold level or meets other selection criteria, then the particular counter is selected for further analysis or processing with respect to at least the particular performance metric Y. A general goal of this step is to reduce the number of counters subjected to further processing. As such, the specific methods used to correlate counters X and performance metrics Y may be relatively loose or generous in terms of allowing candidate counters to avoid removal or filtering at this time.
In one embodiment, a method for measuring the impact or importance of each event counter x with respect to each performance metric Y uses rank correlation such as a Pearson correlation between the ranks of event counter (s) x and performance metric(s) Y. Rank correlation advantageously adapts for possible non-linearity in the dependence between x and Y
In another embodiment, a method for measuring the impact or importance of each event counter x with respect to each performance metric Y uses a Komogorov-Smirnov (KS) test. For example, for a performance metric Y, the computation is made to determine its upper and lower quartile. If the observed value of Y is above the upper quartile, then it may be presumed to have a high value. Similarly, if the observed value of Y is below the lower quartile, then it may be presumed to have a low value. In one embodiment, a KS difference is then found between two cumulative probability conditional distributional curves P(X|high y values) and P(X|low y values). If x has little or no has no impact on Y, then these two conditional distribution should not differ much; if x has significant impact on Y, then these two conditional distribution should differ significantly.
The KS test is especially useful within the context of classification trees as will be discussed in more detail below. Specifically, the KS test operates to eliminate the data points where the values of a performance metric Y are reasonable in range while focusing attention on the differentiating counters for the high and low values of the performance metric Y (e.g. loss).
Grouping of Similar Candidate Counters
There are many groups of, illustratively, UTRAN counters that may be used to represented histograms of various performance metrics, such as the following: VS.IrmcacDistributionRscp.N[val1]LeMeasLtN[val2], where [val1,val2] are used to represent non-overlapping data ranges. Event counters in such counter groups are related since they represent different parts of the histogram of the distribution.
As an example, let X be a metric with its histogram being represented by a vector counter group [x1, x2 . . . xm] where x1 represents the frequency counts in interval Ii=[bi-1, bi], b0≦b1≦ . . . bk≦bk+1 . . . ≦bm, and [b0, bm] is the effective range of the counter. These two methods may be used within the context of the various embodiments in a manner similar to that described above to correlate counters x with one or more performance metrics Y.
One embodiment (rank correlation) contemplates correlating P(X<=bi) and Y, then finding the index i that maximize the correlation such that P(X<=bi) is a representative metric from the counter group selected for further analysis. Additional representative metrics may also be selected in various embodiments, such as one or more of the next index i values that maximize the correlation.
One embodiment (using a KS score) contemplates finding a distribution of X using the set of counters for high/low values of Y, and then running a KS test for finding the difference between conditional distribution function (CDF) distributional curves, illustratively normalized for high loss, and low loss respectively. The KS score is computed as the maximum difference between the two CDF curves. The location bi where the difference is the greatest is calculated as its corresponding P(X<=bi) is used. In addition, the total frequency counts may also be computed for further analysis. As a result, only two counters remain for further correlation analysis.
Various methodologies may be employed to eliminate highly similar or duplicated event counters for further correlation analysis with respect to one or more performance metrics Y. A spectral grouping may be performed to form clusters of these highly correlated counters by computing a correlation for every pair of counters and forming an edge between the pair if the absolute value of the correlation exceeds a threshold such as, illustratively, 0.95 (higher or lower thresholds may be selected).
Selection of Cluster Representative Counters
For each cluster, one or more counters having the largest correlation with Y are selected to be representative of the cluster or counter group. That is, the various embodiments group similar event counters with respect to one or more performance metrics, and then select one counter, or relatively small number of counters, as representative of each counter group.
Model Fitting and Analysis
The representative counters of the various clusters or groups are then processed according to a model. In various embodiments the model may comprise a regression, classification trees, regressions trees and so on depending upon the performance metric Y of interest. After fitting the representative data to the model, an analysis is performed to identify the event counters most closely associated with root causes of performance metric problems.
Classification/Regression Trees
As an example, assume that a performance metric of interest Y comprises a packet loss rate and that a network operator wishes to identify those event counters most related to packet loss rate. It is noted that loss rate (e.g., losses per time interval such as every 15 minutes) correlation modeling is preferred over individual loss modeling due to the discrete nature of individual loss events.
In various embodiments, classification trees and various modifications thereof are used to predict membership of event counters x in one or more classes of categorical dependent variable(s) representing performance metric(s) of interest Y. Various other statistical processing functions may also be employed within the context of the embodiments, such as of Discriminant Analysis, Cluster Analysis, Nonparametric Statistics, Nonlinear Estimation and so on.
At step 410, an upper quartile of Y is computed and a lower quartile of Y is computed, to create two classes of Y, for which classification analysis is performed using selected event counter(s) x. Referring to box 415, observations associated with the computed upper quartile of Y are treated as a high loss class, while observations associated with the computed lower quartile of Y are treated as a low loss class. Other high/low classes/classifications may be utilized.
Step 410 is used within the context of the classification analysis embodiment. In the case of a regression tree embodiment, the division into two classes is not necessary since all existing data may be used. In particular, step 410 operates to define splits associated with the data suitable for use within the context of a classification tree. It should be noted that the upper quartile/lower quartile split defined herein may be adapted by those skilled in the art informed by the teachings of the present embodiments. For example, in one embodiment an upper third/lower third split is used. In other embodiments, an upper quintile lower quintile split is used. Other data splits are contemplated by the inventors.
At step 420, a classification tree is built. Referring to box 425, optional boosting procedures may also be used within the context of building a classification tree. Such boosting procedures comprise, illustratively, the known ‘AdaBoost’ method developed by Freund and Schapire. As a byproduct of the boosting method, for each event counter X, an importance score may be computed with respect to a performance metric Y, which score may be used to arrange or order the event counters x within the context of the classification tree.
At step 430, the classification trees analyzed to identify the most important or relevant event counters x with respect to a performance metric Y.
At step 440, an optional regression analysis may be performed.
Generally speaking, for classification analysis the various embodiments balance the probabilities of two cases by sampling the event counter data, splitting the data into two equal groups (e.g., training and testing) and then building a classification tree/decision tree.
A sample set of event counter data associated with a number of cells in a wireless network used by the inventors and processed according to the embodiments is described herein.
A leaf 510 data split (e.g., 959/959) is evaluated by a counter VS.CARRRPwrSignalling.NbEvt<5938 to provide if true a leaf 512 and to provide if false a leaf 514.
The leaf 512 data split (e.g., 886/470) is evaluated by a counter lubZeroCapacityAlloc.RabPsIBHdspa.normalize<0.02788 to provide, if false, a leaf 516, and against a counter VS.HsdpalubZeroCapacityAlloc.RabPslBHdspa.normalize>=0.02788 to provide, if false, a leaf 518.
The leaf 518 data split (e.g., 425/343) is evaluated against a counter VS.IrmcacDistributionRscp.N.ratio<0.4812 to provide if true a leaf 522 and to provide if false a leaf 520.
Referring to
By performing optional linear regression analysis on the various event counters and their impact on one or more performance metrics, additional characterizing data associated with the wireless network may be provided. In the case of the sample set of event counter data, 70% of the variance in the performance metric denoted as Nack.Rate is explained by the event counters identified as important to this performance metric. Thus, the various methodologies employed herein provide useful correlation of event counters to performance metrics of interest.
Based upon the classification tree and importance plot depicted in
First, high handoff events cause high losses. An event counter denoted as VS.CARRPwrSignalling.NbEvt measures the number of link addition and deletion events. When it is larger than a threshold of 5938 events during a 15 min interval, 489 out of all high loss intervals (959) exhibited high loss, while only 73 out of 959 low loss intervals crossed this threshold. This event counter is fifth from the top of the variable importance plot of
Second, low cell congestion typically means low loss. An event counter denoted as VS.HsdpalubZeroCapacityAlloc.RabPslBHsdpa.normalize measures cell congestion. Half of the low loss intervals exhibit a value of this counter below 0.02788. By contrast, only 10% of the high loss intervals exhibit a value below this threshold. This event counter is ninth from the top of the variable importance plot of
Third, with moderate cell congestion, low paging activities mean low loss.
Fourth, high paging activity together with low radio link setup success causes high loss. This may be due to user equipment (UE) losing network conductivity and low coverage areas, which results in increased UE paging activity by the MME.
Fifth, high cell congestion leads to a high loss.
The various techniques and methods discussed herein may be used to provide cell by cell error analysis, cell grouping error analysis and so on. Moreover, using AdaBoost trees and other boost techniques, improved stability and accuracy may be achieved within the context of the various embodiments.
As depicted in
It will be appreciated that the functions depicted and described herein may be implemented in software and/or in a combination of software and hardware, e.g., using a general purpose computer, one or more application specific integrated circuits (ASIC), and/or any other hardware equivalents. In one embodiment, the cooperating process 605 can be loaded into memory 604 and executed by processor 603 to implement the functions as discussed herein. Thus, cooperating process 605 (including associated data structures) can be stored on a computer readable storage medium, e.g., RAM memory, magnetic or optical drive or diskette, and the like.
It will be appreciated that computer 600 depicted in
It is contemplated that some of the steps discussed herein as software methods may be implemented within hardware, for example, as circuitry that cooperates with the processor to perform various method steps. Portions of the functions/elements described herein may be implemented as a computer program product wherein computer instructions, when processed by a computer, adapt the operation of the computer such that the methods and/or techniques described herein are invoked or otherwise provided. Instructions for invoking the inventive methods may be stored in tangible and non-transitory computer readable medium such as fixed or removable media or memory, transmitted via a tangible or intangible data stream in a broadcast or other signal bearing medium, and/or stored within a memory within a computing device operating according to the instructions.
While the foregoing is directed to various embodiments of the present invention, other and further embodiments of the invention may be devised without departing from the basic scope thereof. As such, the appropriate scope of the invention is to be determined according to the claims, which follow.