The present invention relates to a method and system for adapting the routing of data traffic through a data network in dependence on one or more predicted events within the network.
The rapid growth of the Internet and demand for broadband connectivity has led several large operators to deploy their own Internet Protocol (IP) networks. These are used to support bandwidth-hungry multimedia applications. Despite being engineered to cope with the demands of these applications, some of which are inelastic and require a guaranteed Quality of Service (QoS), unplanned failures occur at various protocol layers, exhibiting themselves as loss of IP connectivity between network components.
When such failures occur, the network is reconfigured in real-time with a process consisting of the following processes:
This invention addresses this challenge by proposing in one embodiment an intelligent network management system that predicts a link failure event and commences the time-consuming network reconfiguration process before the failure actually occurs. In one embodiment any packets that have already been directed at the deteriorating link can still be handled by the link because it is still functional, without new packets being addressed to it. In such an embodiment the link load is also eased by this process to allow possible recovery, especially if traffic overload is the cause of failure. As a consequence, given adequate prior notice, packet loss due to link failures can be eliminated allowing IP networks to provide the level of QoS offered by their telephony counterparts and demanded by several applications and third-party CPs.
In one embodiment predictive rules of the form ‘L1←L2’ are derived, which have the meaning that “if a cluster of links, collectively referred to as L2, fails, link cluster L1 will fail within the next α seconds”. In the embodiment each rule is provided with a degree of confidence β, so that when failure of L2 is logged in real-time, the system automatically triggers the reconfiguration process. In one embodiment routing parameters (such as, for example, tables and/or link costs) are then updated to route packets away from cluster L1 before it fails. In this embodiment L2 and L1 are known as the antecedent and consequent of the rule respectively, and each may comprise any number of individual links (i.e. a link cluster or an item set) that fail simultaneously or within a small margin, γ, of each other. The value of α is easily adjustable and pre-programmed. It may depend on several factors including size of the managed network, predictive capacity of the process and also must not be too short that slight delays in receiving ‘hello’ messages from neighbouring routers results in unnecessary recuperative changes being triggered.
In other embodiments, predictive failure rules are derived based on other events, other than consequent link failures following observed antecedent link failures per se. For example, in one embodiment a failure rule based on any consequent link failures following the generation of a particular type of network message, or following the start-up of a particular type of network client application may be formed. In other embodiments time based failure rules can be formulated, for example if it is observed that a particular link or cluster fails at a set time or within a regularly definable time period (for example, if maintenance is scheduled regularly). That is, failure rules can be derived that relate consequent link or node failures to antecedent events of several different types, preferably observed via network log messages.
In view of the above, from one aspect there is provided a method of routing data in a data network, comprising:
In one embodiment one or more of the failure rules relate an observed consequent failure of a network link to an observed antecedent failure of a network link, the observations being historical network observations.
In one embodiment the change that is made to the routing of data is to route data away or around the network component that is predicted to fail, such that the load on the network component is removed. In another embodiment the change that is made is to reduce the amount of data that is routed through the component, such that the load is reduced, but not removed entirely. In another embodiment the change that is made to routing in the network may be to change the network admission control conditions, so as to reduce the amount of data entering the network, such that the whole network load reduces.
In one embodiment a network component may comprise a network link, or a network router, or part of such which is not necessarily a complete device, but could be just a port on a router or a logical link (several logical flows over the same piece of wire).
In one embodiment one or more of the failure rules relate an observed consequent failure of a network component to one or more observed antecedent network events, the antecedent network events being one or more from the group comprising: a network client application starting up; present time and/or date; and/or a network log message being generated.
One embodiment further comprises storing failure rule confidence values, wherein a failure rule is used to predict a network failure in dependence on its confidence value.
One embodiment further comprises validating a stored failure rule against network observation data obtained after the rule has been implemented so as to determine if said rule should still be applied.
In one embodiment the failure rules are generated by:
In one embodiment the failure rules are repeatedly updated in dependence on the observed network operation.
In one embodiment a hierarchy of failure rules is provided, comprising a plurality of sets of failure rules, each set relating to a part of the network, a set of failure rules belonging to a lower set in the hierarchy relating to a smaller part of the network than a set of rules belonging to a higher set in the hierarchy. In this embodiment each set of failure rules is associated with a network management server responsible for a part of the network, a hierarchy of network management servers being provided corresponding to the hierarchy of the sets of failure rules, wherein a failure rule relating to network components in two or more parts of the network is administered by a network management server at a level in the hierarchy that encompasses the parts concerned.
Another aspect of the invention provides a computer program or suite of computer programs so arranged such that when executed by a computer it/they cause the computer to operate according to any of the preceding embodiments. In addition, another aspect of the invention provides a computer readable media storing the computer program or at least one of the suite of computer programs.
A further aspect of the invention provides a system for routing data in a data network, comprising:
Another aspect of the invention provides a system, comprising:
Another aspect of the invention provides an apparatus, comprising: at least one processor; and at least one memory including computer program code, the at least one memory and the computer program code configured to, with the at least one processor, cause the apparatus to perform at least the following:
Further features and advantages of embodiments of the present invention will become apparent from the following description thereof, presented by way of example only, and by reference to the accompanying drawings, wherein like reference numerals refer to like parts, and wherein:—
An overview of the operation of embodiments of the invention will now be described with respect to
Embodiments of the invention provide a system that predicts network link failures and creates a change in the network before the failure actually happens by instigating policy-based adjustment of routing parameters. In particular, an embodiment of the invention operates in two phases. In the first phase the historical operation of a network is observed (B.4.2), to determine observed relationships between link or cluster failures that have occurred, and subsequent failures of different links or clusters. From these observed relationships failure rules can be derived (B.4.4) that are then applied to control routing in the network during a second, control, phase. That is, in the second, control, phase, the derived failure rules are applied such that if a link or cluster failure occurs, then from the rules a prior knowledge of what additional links may fail in the next a time period is obtained, and remedial action can then be taken so as to route data traffic away from the links that are predicted to fail (B.4.6). One advantage of such a system is that the prior knowledge of failure allows pre-emptive action such that new packets are routed away from the failing link while it still has the capacity to handle packets that have already been addressed to it. Another advantage of this is that the link load is eased, allowing the failing link to recover if traffic congestion is the cause of failure. With such an arrangement, one can maintain a desired Quality of Service in the network by triggering the response to a failure and achieving network stability before a predicted failure even happens.
In addition, in order to allow for network evolution to occur, once created the failure rules are subject to repeated validation to ensure that they are still valid, and apply to the network in its present state (B.4.8).
A more detailed embodiment will now be described with respect to the drawings.
Within the detailed embodiment to be described, the operation of the embodiment falls into several different categories. Firstly, a discussion of how the failure rules themselves are initially derived from historical observation data, and how further rules may be repeatedly derived during operation of the network under the control of the embodiment. Following this, it is also necessary to consider how the failure rules evolve, and whether their validity should be maintained as time progresses. In addition, consideration is also given to where in the network individual control elements may be located, and in particular regarding the hierarchy of control elements, and how failure rules and remedial action can cross catchment areas. Following this further details of the hardware architecture that is employed in the embodiment, and the functional operation thereof are discussed.
As noted above, the first phase of operation in the presently described embodiment is to derive failure rules, that can then be used to control routing in the network subsequently. Periodically-updated historical data from network management agents is typically held in large repositories. Such raw data could consist of real-time alarms or ‘traps’ sent from the agents to its allocated network manager or to a listener capturing such data for monitoring purposes. The first step in the first phase is to therefore identify useful information such as keywords that denote a link failure that is being reported. This can be done either from prior knowledge or by detailed searches of the repository. Due to variations in reporting mechanisms and protocols, it is likely that several keywords will denote a link failure. These identified keywords are then used to parse incoming real-time logs. Once identified, the rows that contain these keywords are filtered out and with more data processing, a transaction database such as that shown in
With reference to
Having obtained the identified observed failure relationships, it is then necessary in the present embodiment to consolidate such similarities into rules. This is performed as follows.
The created transaction database (
As will be described in more detail later, in the second phase of operation these derived rules, of the form ‘L1←L2’, are stored in an apparatus according to the present embodiment when the system is first run. The incoming logs are then monitored in real time and when a failure of the antecedent of any of the stored rules is spotted (i.e. L2 in the above rule), the network reconfiguration process is commenced straight away and the network restores itself assuming L1 is about to fail before the failure occurs. This communication between the monitoring application in the control layer network manager and the lower layer managed agents who implement the routing reconfiguration is described further later.
Note that the parameters used for prediction are not confined to using one link cluster failure to predict another. In other embodiments Pearson correlation or neural networks techniques, such as the multilayer perception, can be used to identify other parameters in the available historical data that could possibly cause and/or indicate a link failure. Embodiments of the invention require only information from status logs which are then used to predict failures in network links. The information can be failure messages or any other message, e.g. messages generated by applications. One of the main aspects of embodiments of the invention is that of how to make use of predictive information coming from a control intelligence in the network so as to change routing data. One embodiment of this invention predicts one or more network link cluster failures occurring α seconds after another link cluster failure. However, it is easy to understand that in other embodiments of the invention a number of other events can be predicted from a range of data logs, and that failure rules can then subsequently be derived and used to predict failure and control traffic routing. Examples of such failure rules include but are not limited to the following types:
Type 1: If message sequence abc appears, link failure messages for links d, e and f appear within α seconds.
Type 2: If application a is initiation/altered, link b fails within α seconds.
Type 3: Link a fails every day of the week at 10:00 AM.
Type 4: Link a fails within α seconds if the parameters x, y and z report values of x1, y1 and z1 respectively.
Type 5: If event E occurs then L1 will fail (E could also be any combination of events).
Type 6: If x>t then L1 will fail (where x is a value derived from some event and t is a threshold).
Type 7: Instead of x>t we could have an arbitrary logical combination of values tested against thresholds, e.g. x>t1 and y>t2 or z>t3 etc
Type 8: If M(X)>t then L1 will fail, where M is a numerical model (system of equations, a neural network etc) and X is a vector of values derived from observed events (counts, averages, network parameters etc).
Some or all of these different types of failure rule may be included in one or more embodiments of the invention. Note also that embodiments of the invention do not require a particular process to be used for prediction. While in the detailed embodiment being described association rule mining is used, it is also possible in other embodiments to make these predictions using a variety of other data analysis techniques, for example Bayesian Networks, Hidden Markov Models etc.
Before going on to describe how the derived failure rules can be used to control data traffic routing, it will be useful to take a look at another aspect of any derived rule, being that of the length of time for which it is valid, given the likely evolution of the network. In this respect, aside from using the derived rules to better manage network routes in real-time it is also important to periodically evaluate the validity of the rules in parallel. The following process has been devised to accomplish this aspect.
The confidence of a rule indicates its ‘correctness’. If the network has changed such that a rule is less and less applicable, its confidence will decrease gradually over every evaluated time period. This is shown in
More particularly, with reference to
In terms of how a rule is validated over time, there are several options. For example, in one embodiment rules can be evaluated periodically using transaction data that was not considered at the time of the previous evaluation. With such an arrangement a transaction table such as shown in
An alternative to the above is for all rules in the repository to be flushed out at every evaluation and re-computed from scratch. In such an arrangement in another embodiment then again transaction data from network logs is compiled, and completely new rules then re-computed on the basis thereof. The trade-off between these two techniques depends on ease of data management and communication of rules to managing agents, and the computational complexity and resources required for processing all stored data every time.
However, whilst rule validation is important to allow for network evolution, there is a downside to validating rules in this manner. More particularly, if a given rule is correct, when it is implemented, it could cause the predicted failure to no longer occur (especially if traffic overload is the original cause of failure). In such a case, feedback from the network over the next M periods would cause the rule to be dropped because the rule confidence would no longer hold. For example, if the rule is validated against log data obtained from when the rule was first implemented, then the operation of the rule may in fact have prevented the failure that the rule predicts from occurring, and hence the operation of the rule effectively invalidates itself.
In order to get around such problems, in another embodiment the rule evaluation could be done by keeping a repository of recently extinct rules from which a recently decommissioned rule can be reactivated if its rule confidence increases after it has been decommissioned (which therefore requires its confidence to be monitored for N periods even after it is no longer used). However, this may still not be the most efficient way to validate rules as stability is never achieved in the control plane (some good rules may be constantly dropped and then reactivated). In yet another embodiment, therefore, operator knowledge may be used to single out rules that should never be dropped. Still another embodiment could use a strategy that switches off rules with decreasing confidence to test if the rule is still required, and then reactivates the switched off rule immediately if the predicted failure indeed occurs.
With the above, therefore, within embodiments of the invention some form of rule validation is required to take place, to account for network evolution. However, how that validation occurs can take many different forms, and any of the above described techniques can be used in embodiments of the invention.
Thus far we have discussed how rules are derived, and validated. In this section we introduce the hierarchical arrangement of hardware and software agents employed in an embodiment of the invention to permit management of rule generation, and employment of the rules to control routing.
The prediction process used in embodiments of the invention relies on retrieving network information from a large catchment area—in fact, the larger the area, for a given granularity of network data per unit area, the more likely it is that an association that exists can be identified by the rule miner and a larger number of link failures can be predicted. Thus, the rule miner is assumed to operate on a central management platform 32, as shown in
Each instance of the network management application running on each management platform (described further later) can either transfer all its rules to all its managed agents and each agent then filters out rules that are only relevant for itself or this filtering intelligence is performed by the manager itself and it only sends an agent the rules it needs and handles cross-catchment area rules by itself. Thus, for example, management platform 32 may propagate down to platform 34 those rules that apply to network nodes and links within network area 342, whilst retaining control of those rules applying to the other parts of network area 322 other than part 342. Similarly, management platform 34 may propagate downwards rules relating to network areas 362 and 364, managed by management platform 36, and area 382, managed by management platform 38.
The above described hierarchy is carried through to the smallest network granularity possible with a trade-off against not making the catchment area too small that the interactions are too infrequent and confined that prediction is no longer possible. Whether or not the rule miners in the hierarchy collaborate with each other and communicate when an event is likely to trigger cross-catchment area failures depends on network policy, traffic overheads generated by the required information sharing and the capabilities of the implemented NMS protocol. This is an important decision when there are multiple management applications handling parts of the network that are in the same level in the hierarchy (such as areas 362, 364, and 382 in FIG. 3)—i.e. in a distributed NMS framework. Should cross-catchment area manager-to-manager communication be enabled, MPLS or a similar mechanism may be used to ensure that such management information predicting a failure is propagated to the relevant manager in sufficient time.
Note that in the present embodiment rules themselves need not be communicated from intelligence nodes to network nodes; only the actions that the routers must take to respond to a given predicted failure need be sent to them. Rules will have to be communicated between instances of the same intelligent software i.e. the managing agents running on the management platforms; especially if they have the same power in the network (e.g. rules that apply across catchment areas will be communicated from the higher level server to the local managers of the relevant catchment areas).
In terms of how rules are propagated down the hierarchy, and how management actions are passed on to routers in the network, in one embodiment the basic Simple Network Management Protocol (SNMP) is used as an example. That is, SNMP is only an example of a protocol that can be used to communicate decisions. Even a custom-built protocol for this purpose could be used. The actual protocol used is only the carrier for the system to transport its instructions to routers and embodiments of the invention include all variations of protocols that can be used for this purpose.
In the case of SNMPv1, since it does not support imperative commands from manager to agent, the managing application server in the management platform functions with the SNMP engine to generate Set requests as explained in Stallings, W., “SNMP, SNMPv2, SMNPv3, and RMON1 and RMON2”, 3rd ed., Addison-Wesley, 1999. Note that the management application must know which Management Information Base (MIB) parameters it wishes to change in the agent's MIB in order to trigger a change before its ‘real’ cause has happened in real life and this knowledge can be gained from observing a ‘real’ scenario and triggering the same change in the predicted scenario. However, an exception occurs when a link is predicted to fail but does not because the pre-emptive action has eased its load, allowing it to recover. In this case, an override must be devised where the managed agent could actively communicate this recovery if the link is still alive after α seconds of prediction so that the rule mining intelligence can use this information in future predictions. The intelligent managers can also decide which routers within their catchment area must be affected in responding to a predicted failure, so that only specific routers respond to a given predicted failure. As an extension to this feature, it is possible to extract this node-choosing feature as a separate generic entity that decides, given an input set of priorities and a current model of the network, which nodes should proactively change their routing data in order to respond to a predicted failure. (In this example, such a generic system could be instructed (or dynamically gain the knowledge) that an important factor in choosing nodes is the value of the traffic that is about to be transmitted from that node which would be lost if the predicted link failure occurred. Another important priority could be the current occupancy of the router buffers.)
Note also that encryption or some form of security/initial authentication should be devised to prevent hackers placing message generators that flood the routers with Set requests. Even if the Set requests are not fulfilled because these senders are not authenticated to access the MIBs, processing of these messages will occupy router capacity and could result in Denial of Service attacks.
Additionally, in one embodiment a GUI is also proposed to allow a near-real-time visualisation of the failure model created from the predictions—the thickness of a line between two network components is proportional to the probability of failure of one, assuming the other has already failed. This front-end allows an operator to monitor and/or manually intercept the process if necessary.
Each server can have its own rule database (52, 542, 562, 582) or access a shared repository, where currently valid rules are stored, information flows both ways in these connections.
Each of the servers 54, 56, and 58 controls a particular part of the network, comprising link routers 546, 548, 566, 568, 588, and 586. The routers marked in bold in the network plane (546, 566, 586) are the intelligently chosen routers by the servers. These are the routers that the servers will contact to create changes in routing. The routers that are not in bold (548, 568, 588) are nevertheless a part of the network.
Whilst
The arrows that flow out from the cloud into the data warehouses 60 and 62 on the far right correspond to B 6.24 in
Firstly, it is assumed that there is available a set of historical network operational log data, from which initial rules can be derived. As discussed above, during the operation of embodiments of the invention, this log data is updated with events that occur due to the continuing operation of the network, as shown at B 6.2.
At B.6.4 the user analyses historical data to identify key words that indicate an antecedent or consequent event (failures of link clusters L2 and L1 respectively in this case). Alternatively, these key words can be obtained by the rule discovery process from a previously defined dictionary or thesaurus. Example key words include: “ISIS link 1 failure”, “BGP link 1/0 down/up” etc. Link IDs ‘1/0’ and ‘1’ can be used along with their IP addresses to create a unique identifier for that link.
At B.6.6, once a list of these key words has been formed, these are loaded into the application, after which begins to parse all historic data for these events. The application creates a transaction database as shown in FIG. 1—a table of all link clusters, L1 and L2, that fail within α seconds of each other.
This transaction database is then mined at B.6.8 using an association rule miner to create rules of the form ‘L1←L2’—“if link cluster L2 fails, then link cluster L1 fails within a seconds”—with a confidence β, as discussed previously. These rules are preferably filtered for a minimum confidence (chosen by the operator depending on the number of ‘false alarms’ that can be tolerated if a prediction is not successful) and the resulting list is stored in respective repositories 52, 542, 562, and 582 at B.6.10, to be periodically validated, at B.6.12, in the manner described previously. In this respect, as shown in
The routers in the network plane produce live data about failures which are uploaded to the repository where the network management application obtains its data. Assuming the latency between event occurrence and detection by the network management application does not exceed a, the management application running on each server can monitor incoming logs for the network part for which it is responsible in real-time/near real-time at B.6.14 and if failure of all or a significant part of cluster L2 is detected at B.6.16 within a small time margin (with allowances for various delays in the network), the network reconfiguration process is triggered as follows.
It is assumed that the network management applications running on the management platform servers 51, 54, 56, 58 have sufficient privileges to read and alter all required parameters in the routers' Management Information Base (MIB). The applications ‘know’ which parameters in the MIB to alter (either determined dynamically using intelligence or pre-programmed) to force a link to be avoided in routes mapped by the routers for incoming packets (one method of doing this is to change the ‘cost’ of a link to a high value, this is used as an example henceforth). Once failure of cluster L1 is predicted as above, at B.6.18 a management application determines which routers to contact to adjust the corresponding link cost. This can either be pre-set (e.g. “only all central nodes in the autonomous system will make adjustments”) or intelligently, for instance, based on current traffic flowing through the affected link cluster L1. This decision can be made again from accessing the MIB. For example, if one of the links in cluster L1 (i.e. the cluster that has failed) connects two routers, one of which has a buffer that's nearly full with traffic of high priority, this router could be chosen to adjust its link cost in a way that minimises loss of this important traffic. Variations in implementing this have been described earlier.
Once the routers to be contacted have been decided upon, the application works with the SNMP engine at each server to generate Set requests to change the relevant MIB parameters (link cost, in this example). These SNMP messages are sent out, with a priority tag depending on how small α is. The MIB parameters are therefore altered as desired and this has the effect that the routing tables are changed so that the cluster L1 is assumed to have failed while it is still functional. L1 can continue handling packets that have already been addressed to it but not receive any more. Additionally, it is important for the application to ensure that the advantage gained by the prediction is not lost by routing elements returning to prior status quo if a ‘hello’ message is received from failing cluster L1 in the time α (i.e. between the prediction response taking effect and the actual failure of L1). One way of doing this is to instruct adjacent routing elements to L1 to ignore these ‘hello’ messages from the failing L1 cluster for a period α′ (where α′≧α).
There could potentially be several ways of effecting the desired change, one of which is outlined above. This process has assumed that the network manager cannot issue direct commands to its agents and can only produce a change indirectly by altering the MIB itself.
If the failure of L1 occurs, at B.6.20 this data log is transmitted back to the historical data repository at B.6.24 and is used to monitor the ‘correctness’ of implemented predictive rules over time. If the failure of L1 does not occur as predicted, elements can receive or poll for, process and take action as normal to any ‘hello’ messages received after this period of α′, as shown at B.6.22.
Various further modifications, by way of addition, substitution, or deletion, may be made to either of the above described embodiments to provide further embodiments, any and all of which are intended to be encompassed by the appended claims.
Number | Date | Country | Kind |
---|---|---|---|
10250540.1 | Mar 2010 | EP | regional |
Filing Document | Filing Date | Country | Kind | 371c Date |
---|---|---|---|---|
PCT/GB2011/000396 | 3/21/2011 | WO | 00 | 9/21/2012 |