The present invention relates generally to determining a network pathway based on parameters of the pathway measured in more than one direction. The present invention relates more specifically to methods for transferring a communication signal between two end points from a first network pathway to a second network pathway when an upcoming or imminent failure is detected by either of the two end points on the first network pathway.
Failover is the capability to switch over automatically to a redundant or standby network upon the failure or abnormal termination of a previously active network. There is a general uncertainty surrounding failover in a converged environment, where convergence refers to the combining of real-time applications such as voice and video with data applications over a unified connection, or access method. The uncertainty is the survivability of both the real-time and data applications during primary network failure conditions.
As real-time voice, video and data services are converged, businesses need continuity to support their networking needs. Networks must failover to backup connections without loss to any data or live VoIP calls, for example.
Some current failover technologies do support the survivability of data applications during network failover provided that the application's timeout value exceeds the failover time. However, the survivability under failover conditions of real-time applications such as live VoIP calls is not presently supported due to lengthy timing issues and the nature of VoIP being comprised of UDP/connectionless traffic states. The connectionless nature of the application does not provide any error checking or retransmission to maintain the application state during the lengthy failover condition. Furthermore, unstable lower link connections may further impede network failover and compound the loss of data under failover conditions.
Existing Internet access methods combine the use of multiple access methods to deliver a unified network connection for the transport of VoIP calls, video over IP, and burst type data applications. However, in the event of a lower link network failure, they currently cannot support the survivability of real-time applications such as live VoIP calls during a failover condition.
The prior art is mainly in the area of link failover especially in the TCP/IP area, e.g. Link2Web protocols etc., and does not provide any solution to separating false positive from true positive failure detections.
Some prior art is directed to network failover solutions for TCP/IP (referred to above), which is essentially based on timers. For example, timers are typically used to define an interval for one peer signalling to another peer to verify that the communication session between them has not ceased. The signal may be a ping message or other keepalive or heartbeat message. If there is no response from the second peer to the first peer, then the first peer assumes that the communication session has ceased and therefore a new pathway is selected, i.e. the communication is sent to another network component. Clearly, this technique requires the primary pathway to have already failed before a switch to a secondary pathway can occur. Due to the restrictions defined by the timers, in certain circumstances the pathway sometimes cannot be changed quickly enough to avoid a connection loss noticeable by client devices at the network end points.
The prior art also appears to address the bi-directional nature of voice communication for example by synchronizing the timers, and then synchronizing transfer of the communication to another pathway. These techniques also require that the primary pathway fail before the secondary pathway is used.
There are numerous further disadvantages to the prior art solutions described. These include: dependence on the distance between two peers wherein the intervals and related timer based parameters may need to be adjusted for an optimal solution; often depending on the various parameters in some cases the determination of the fact that there has been a failure along the pathway does not happen quickly enough, and therefore movement to a new pathway does not occur quickly enough, which results for example in the voice area in a “dropped call”, which could be counteracted by increasing the frequency of sending the signalling message, but this in turn increases the resource requirements of the network pathway. False positive error detection is also a problem with this technology.
One such technology is disclosed in U.S. Pat. No. 7,269,157 to Klinker (“Klinker”). Klinker, for example, is focused on connectivity verification and/or traffic analysis. The prior art approaches have a number of disadvantages, including most significantly, (1) significant false positives (particularly if there is significant network congestion), and (2) a higher likelihood of dropped calls in the VoIP context because there is no solution for providing adequate control of the remote network component, such that the connection may terminate prior to deactivation of the remote network component.
Furthermore, Klinker is specific to Border Gateway Protocol (“BGP”), and in fact in many ways is an enhancement of certain aspects of BGP. If one removes BGP (being prior art) from Klinker, what is left is essentially a device that is operable to check if another end point is alive, which is essentially taught by U.S. Pat. No. 6,078,957 to Adelmann. Klinker observes packet flow based on type of traffic (i.e. HTTP, voice etc.) between two components in a network. Klinker lacks contemplation of (1) an overall architecture that enables control of transportation at all peers, and (2) collection of enough information to support selection of a new path in time before communication failure, which is essential to failover to support real-time applications. Klinker is clearly limited to analyzing traffic flow.
There is also no suggestion in the prior art for a network layer implementation that provides advantageous failover capability. Prior art, particularly prior art referring to ICMP, generally uses higher network layer mechanisms. The higher network layer mechanisms are generally more dependent on traffic, and therefore in the presence of congestion false positives may result (i.e. identification of increased traffic as performance degradation), which accordingly results in reliance on erroneous information for failover purposes.
Load balancing is another field that provides a solution to some of these issues, i.e. determining pathways based on performance or cost, but load balancing is not bi-directional and therefore does not address the problems identified above.
In these applications, a drop in bandwidth, and not absolute failure, along the primary link should dictate a switch to the secondary link. The prior art solutions do not take bandwidth into consideration.
Therefore, what is required is a failover technology that distinguishes false from true positive detections and that can perform a failover in a short response time so as to support real-time applications. Furthermore, the technology should be able to ensure that all peers become aware of the failover within such time as to prevent a lost connection.
The present invention provides a computer-network-implementable method for failover of a failing network connection between at least two network points comprising: identifying and establishing, by a network control means at each network point, at least a primary connection between the at least two network points; gathering intelligence, by operation of a network information means at each network point, such intelligence relating to at least one performance parameter of each primary connection; establishing a threshold condition for the primary connection; and accessing a decision tree at one or more of the network points when the threshold condition is met for determining whether to avoid the primary connection, wherein if the decision tree determines to avoid the primary connection, the primary connection is pre-emptively avoided, a secondary connection is created between the at least two network points by the network control means, and data is communicated over the secondary connection between the at least two network points.
The present invention also provides a system for failover of a failing network connection between at least two network points comprising: a plurality of network control means, each network control means linked to one of the at least two network points, each network control means identifying and establishing at least a primary connection between the at least two network points; a plurality of network information means, each network information means linked to one of the at least two network points, the network information means gathering intelligence relating to at least one performance parameter of each primary connection; a decision tree at one or more of the network points, wherein the decision tree is accessed when a threshold condition for the primary connection is met, the decision tree determining whether to avoid the primary connection, wherein if the decision tree determines to avoid the primary connection, the primary connection is pre-emptively avoided, a secondary connection is created between the at least two network points by the network control means, and data is communicated over the secondary connection between the at least two network points.
In this respect, before explaining at least one embodiment of the invention in detail, it is to be understood that the invention is not limited in its application to the details of construction and to the arrangements of the components set forth in the following description or illustrated in the drawings. The invention is capable of other embodiments and of being practiced and carried out in various ways. Also, it is to be understood that the phraseology and terminology employed herein are for the purpose of description and should not be regarded as limiting.
The present invention provides a multidirectional pathway selection technique that overcomes the limitations of the prior art.
The present invention is directed to managing a network connection pathway for supporting communication between the two or more end points at any given time. These communications may occur in two directions, therefore the connection can be initialized and managed by network devices at either or both of the two end points, such that the invention enables bidirectional connection management. One implementation of the invention is a multipath redundancy protocol.
The present invention is also directed to providing a solution for avoiding point to point or point to multipoint communication failures including but not limited to peer to peer, one to many or many to one communication failures. One specific embodiment of the invention is a point to point or point to multipoint network communication fast failover utility that enables definition of secondary communication connections before communication failure occurs, wherein the utility gathers intelligence by monitoring one or more communication parameters and a secondary connection is initialized and utilized when one or more of the parameters corresponding to a primary connection fall below a predetermined threshold. Correspondingly, the present invention provides for the management of the connection on a multidirectional basis to enable fast failover. This enables the selection of a secondary connection and the transfer of resources and communication to the secondary connection, prior to the primary connection fully failing. This technique provides a solution that is operable where particular client devices at the end points defining the connection are not aware that the primary pathway has failed or been lost. In an implementation of the present invention wherein the client devices are IP telephony devices, this can prevent experiencing what is known as a “dropped call” due to loss of a network connection.
The present invention may be implemented into a network or network component of logic that enables identification of a connection that is “starting to fail”, or is “degrading below a predetermined threshold”. When this happens, a change in the connection is initiated whereby the failing or degrading connection is avoided. In this case, there could be gaps in a communication session, but the communication session (e.g. a call) won't be dropped, and thus loss of data is minimized.
The present invention is based on a point to point or point to multipoint architecture.
The network information means uses a technique best understood as “communication path propagation” that is operable to determine network conditions relevant to the communication path for example by delivering a series of pulses to the various points in the communication path for the purpose of identifying communication failure, but in a way that is operable to do so prior to complete communication failure. The present invention is operable to measure response (or lack of response) to said pulses in order to identify communication failure of a primary connection or path pre-emptively, prior to complete communication failure, so as to gather intelligence for enabling the selection of a secondary connection or path to which the communication can be redirected without communication failure. The network conditions measured by operation of this technique may relate to bandwidth, jitter, saturation, loss and/or latency, and other network conditions such as network costs. The network information means may operate on a network layer such as network layer 4 or higher, corresponding to a layer important for real-time applications such as VoIP. However, it should be noted that the technology could work at lower layers. The network information means described could be implemented to a communication network in a variety of ways. The network information means can be implemented (1) as a client within a network node (e.g. a router, switch, computer, or server, etc.), (2) as a failover system (physical and/or logical, in hardware and/or software), (3) as part of a bonding utility, aggregating utility, or load balancing utility, (4) as part of a phone system, e.g. PBX, telephone, or IP phone, (5) as functionality implemented as part of a network manager, or (6) as part of a communication protocol.
It should be understood that each “point” may be a network node, host, peer or application.
It should further be noted that the present invention teaches a system that may be implemented at each point to enable the fast failover solution described, whereas the prior art may require different components in order to provide bidirectional fast failover.
Initially, a packet distribution technique may be used to establish the primary connection and secondary connection pathways. The present invention may require establishing a subservient (master-slave) relationship between a primary and a secondary connection.
The present invention is directed to managing the network pathway for supporting communication between end points of a point to point or point to multipoint connection at any given time. The present invention is also directed to providing a solution for avoiding point to point or point to multipoint network communication failures. The present invention optimally avoids “false positive” failure detections that commonly occur in many existing techniques. Existing failover techniques are not able to support real-time applications such as VoIP due in part to timing issues, including required timeout values, corresponding to these applications. In order to support these applications, failover must be performed in near real-time. This necessitates responding to failed connections very quickly. One way to achieve this speed is to query the connection state at very short intervals. However, such an algorithm will cause a significant overhead and may cause the connection to become unstable. For example, a link quality report (LQR) packet using the link control protocol (LCP) may be used to poll a point on the opposite side of a connection, but minimization of the LCP response time is likely to cause a connection, especially a lower connection, to become unstable. Thus LCP cannot be used to determine failure on lower connections in a timely fashion to support real-time applications. A similar result is likely to be obtained using compression control protocol (CCP).
This is achieved by implementing an out of band packet relative to all other failover control protocols. The present invention provides a technique to determine lower connection failure that supports disparate connection bonding and therefore may have lower connections that vary in latency and bandwidth. Thus, this technique functions on a connection by connection basis and is adjustable.
A network information means may be provided for obtaining and gathering intelligence including, for example, status, state, and connection information in accordance with these requirements. For example, the network information means may monitor the frequency of verification messages (described below) and can be configured to adjust thresholds for receiving these messages based on connection characteristics. The network information means may also monitor the number of unanswered verification messages and allow for the configuration of a threshold number of lost packets to trigger a failure mode. Additionally, the network information means may be adjustable to account for jitter over the connections. The network information means may also enable the functions described below.
In one implementation of the invention, the decision tree is implemented in the layer that manages packet distribution, typically for example the top layer.
The present invention can be implemented as follows: (1) information or intelligence is gathered or collected or accessed by a network information means regarding each relevant connection; (2) a structure is provided for a “connection verification message” corresponding to the pulse referred to above, which each point in the network is operable to send to other points; (3) a configurable threshold is set for parameters 37 indicating connection performance degradation, for example loss, bandwidth, jitter, saturation, latency, or other parameters, per connection, based on the transfer 21 of “connection verification messages” by a network control means; (4) once the threshold is met 23, a “decision tree” is initiated which is operable to analyze the parameters to determine 25 whether to trigger a change 27 in the connection pathway, for example (a) a further threshold based on parameters of the connection performance parameters given the configuration, and/or (b) interval or timing of the connection verification message, and/or (c) drop in bandwidth, etc., (d) jitter or latency, (e) costs, (f) application specific parameters, (g) traffic specific parameters, etc.; (5) the results of the decision tree are applied by the network control means, for example if the parameters are present for creating a secondary connection, then a secondary connection is created by the network control means which may involve proving the secondary connection to the other network points. It may also involve closing 31 the primary connection and initiating a process for the other point closing 33 the primary connection. It should be understood that cost may be taken into consideration in defining the secondary pathway, for example based on time of day. Furthermore, other performance parameters may be used including congestion protocols or applications. If the existing connection pathway has not failed, the packet may be sent 29 over that connection. Otherwise, the packet may be sent 35 over the secondary connection.
Each point (usually a network component) includes functionality for sending a connection verification message (the packet referred to above) to a remote point using its network control means. In another implementation, it is an intermediary that sends and received these messages, for example a router or relay that relays the messages to other points that are fast failover participants. The connection verification message provides information to the network information means of each point regarding connection performance parameters, as mentioned above.
This technique combines two failure detection measures to ensure a maximal true positive detection rate and a minimal false positive detection rate. The determination of whether a connection has failed may be made based upon the status of a lower connection. By gathering intelligence regarding the parameters of the connection, an imminent failure of the connection can be detected within a response time supporting real-time application data. As previously mentioned, these parameters may, for example, include loss, bandwidth, jitter, saturation, latency, other performance based parameters, or any combinations thereof. The detected imminent failure can be handled by pre-emptively avoiding the connection.
For example, where performance of the connection is monitored by loss, the ability to transmit messages over a connection and the current bandwidth being transmitted over the connection may be determined.
The true positive detection may be made using the connection verification message. The connection verification message may be considered to be a pulse as described above. The message, once received at the destination point, may be operable to require the destination point to send an acknowledgement signal (ACK) back to the source point. It should be understood that the pulse described above, as implemented in IP, provides this acknowledgment as a matter of course. The response time of the ACK signal can be used by the source point to identify a lower connection failure within milliseconds of a failed connection condition. Such a time frame typically enables control and manipulation of lower connections close enough to real-time so as to support real-time applications. A particular threshold may be set such that by counting un-ACKed messages, a failure mode may be pre-emptively enabled once the threshold is exceeded.
In one implementation, the connection verification messages are very small packets sent out of band of the actual application data. This enables the packets to be transmitted in a network layer rather than the data layers, and therefore transmission rates are likely to be better. Generally this allows the system of the present invention to require less time than in the prior art to discover that a problem may be about to occur. This implementation also enables the solution to be OSI layer independent.
At times of heavy traffic, however, the use of the ACK alone may result in false positive readings. These periods of heavy traffic often occur due to burst data traffic and the asymmetrical nature of broadband connections. Heavy traffic loads over a primary connection generate increased latency and in some cases very high latency on lower connections. One result is that the ACK for the connection verification message may not be received within the threshold time. A trivial solution of desensitizing the detection strength of the present invention would provide a counterproductive result of reducing true positive detections.
Thus an optimal solution is to only utilize the connection verification message at times of relatively normal traffic loads over the primary connection. The decision tree is thus operable to avoid false positive detection by defining rules including implementation of bandwidth thresholds to confirm whether a positive detection is true or false. The thresholds may correspond to maximum bandwidth in and out on the primary connection. Where lower connection bandwidth exceeds the maximum bandwidth threshold for the outbound connection, the connection verification message results may not be engaged since heavy traffic is known to generate false positives. Instead, when the lower connection bandwidth of the outbound connection exceeds the maximum bandwidth threshold, then the lower connection bandwidth of the inbound connection can be analyzed. If the inbound bandwidth also exceeds the corresponding maximum, then a determination may be made that the connection verification message may not accurately reflect the connection status, and the failure mode enabled by the message may be disabled.
However, if the outbound bandwidth exceeds the traffic and is actually causing a failure, the inbound bandwidth may show a very low throughput. Therefore, by checking the inbound bandwidth and comparing it to a maximum bandwidth threshold, a failure mode can be engaged and the lower primary connection can be avoided in the connection. This provides significant improvement towards eliminating false positives. However, this measure alone also may result in some false positive detection in a particular scenario wherein heavy traffic is completing a transaction. In this scenario, the outbound bandwidth may exceed the threshold, resulting in a measurement of the incoming bandwidth. By this time, the data transaction may be completing, causing the bandwidth to gradually (or quickly) lower to less than the maximum incoming bandwidth value, which would cause a false positive failure mode. A minimum threshold can also be applied to the outbound connection.
Thus a failover in accordance with the present invention can be triggered by any of three events: (1) where the number of connection verification messages not acknowledged exceeds a loss threshold and the outbound bandwidth is between the maximum and minimum thresholds; (2) where too many connection verification messages are not acknowledged, the outbound bandwidth exceeds the maximum threshold, and the inbound bandwidth is lower than the minimum threshold; or (3) where too many connection verification messages are not acknowledged and the outbound bandwidth is less than the minimum threshold. Upon any of these failover modes, the connection may be avoided within such a time as to support real time applications.
The occurrence of any of these three events is identified by operation of the network information means, and managed by operation of the network control means. The network control means is best understood as a utility that implements the decision tree, in a pre-emptive and bi-directional manner.
The network control means described can be implemented to a communication network in a variety of ways. It can be implemented (1) as a client within a network node (e.g. a router, switch, computer, or server, etc.), (2) as a failover system (physical and/or logical, in hardware and/or software), (3) as part of a bonding utility, aggregating utility, or load balancing utility, (4) as part of a phone system, e.g. PBX, telephone, or IP phone, (5) as functionality implemented as part of a network manager, or (6) as part of a communication protocol. The network information means and the network control means therefore can be implemented as complementary processes, features or utilities integrated within the communication network resources (1) to (6) described in the preceding sentence.
It should be understood that any other monitored parameter or combination thereof, as identified above may be used in the decision tree, depending on the particular network conditions expected to cause a failure mode.
Once a failure has been detected, the point detecting the failure may notify other points of the primary connection outage by sending a message to the points over all the remaining connections. This can be provided by techniques also used with load balancing, bonding or other network protocols. This notification may further enable the point to avoid the primary connection and to override its own pulses for the connection. This avoidance technique also helps speed up the failover process or pathway selection process. A logging utility may also be provided for recording the failure.
The aggregation technique also identifies when primary connections come back online. Primary and secondary connections that are avoided continue to receive pulses and can be reintegrated to the communication path through packet distribution.