The present invention relates to a method and apparatus for distributing network management processing load across a plurality of network management processing elements of a cluster. The network may for example be a communications network. The present invention also relates to a computer program product configured, when run on a computer, to carry out a method for distributing network management processing load across a plurality of network management processing elements of a cluster.
In recent years, there has been a significant increase in the number of network elements being managed by mobile network management systems. In GSM systems for example, the number of network elements has tended to be of the order of hundreds of network elements. In LTE, networks with hundreds of thousands of network elements may not be uncommon. Mobile networks are also increasingly heterogeneous and may be running multiple different radio technologies including 2G, WCDMA, LTE and Wireless LAN. Complex radio architectures may be put in place to support mobile networks, architectures including macro, micro, pico, and femto cells.
The computers on which management systems for such networks run have also evolved, with many computing entities now being run in a highly distributed manner, often deployed on blade systems using a cloud approach. Such distributed infrastructures often have support for the substantially seamless addition or removal of computing elements form the cloud, and for the adjustment of both physical and virtual machines allocated to the cloud.
Management systems for communication networks have developed to take advantage of the above discussed distributed architectures. Such management systems allow the processing power of the distributed infrastructure to be harnessed in order to manage large heterogeneous mobile communication networks. Individual management applications are frequently designed to scale horizontally: many instances of the application are run in parallel in separate processes, with each instance of the application carrying the load of a small portion of the managed network. The capacity of the management application can thus be adjusted simply by increasing or decreasing the number of instances of the application that are running at any given time. The processing load for the entire communications network may be balanced across the different instances of a management application, for example by dividing the network elements evenly across all instances of the application, or by measuring the load being generated by each network element, and planning distribution of the load based on these measurements.
Current approaches to ongoing load balancing in distributed computing platforms focus on providing support in the distributed computing infrastructure that allows applications to elastically increase and decrease the number of instances they are running. In the case of communications network management however, the semantics used between network management applications and the network elements they control render the practical implementation of such support highly challenging. In general, management applications use stateful session semantics to interact with network elements. An application establishes a management session with a network element, carries out some operations and then closes the session. Such management sessions may be used to collect statistical information, carry out configuration operations or receive event streams. A particular instance of a management application is generally required to handle a particular network element for the duration of a particular management session. Certain management sessions, including for example those for alarm or event collection can be very long lived, complicating the redistribution of network elements among management application instances for the purposes of load balancing.
In a managed network, management applications may be notified of the addition or removal of network elements by amendments to the application's topology, by an incoming connection from a previously unknown network element, or by discovering the network element. Management applications may be informed of changes to the number of management instances of the application by the management application topology and by the underlying distributed computing platform. Each time a network element is added to or removed from the network, the configuration of a management application instance is amended and that instance is re-initialised. When the number of instances in the management application is adjusted, the configuration of a number of instances in the application is amended and each of those instances is re-initialised. As noted above, the session oriented semantics used between management applications and controlled network elements mean that such amendment and re-initialisation of instances can be problematic. In order to change the allocation of a network element from one management application instance to another, the management session between the network element and the “old” management instance must be shut down and a management session must be established between the managed element and the “new” management instance.
Current systems address the difficulties of load balancing in management application instances either through the manual allocation of network elements to management application instances, or through the use of centralised algorithms. Execution of redistribution between management application instances must thus be centrally controlled, requiring coordination of a single operation across all concerned network elements and application instances. Each amended application instance must be shut down, its new configuration set and the instance must be restarted. Such procedures are unwieldy and highly difficult to automate owing at least in part to the high degree of coordination required across all amended network elements. The complex load balancing algorithms required in the centralised procedure are difficult to implement, and in the event of failure of an application instance, the load carried by that instance cannot be automatically reallocated among other instances.
It is an aim of the present invention to provide a method and apparatus which obviate or reduce at least one or more of the disadvantages mentioned above.
According to a first aspect of the present invention, there is provided a method of distributing network management processing load across a plurality of network management processing elements. Each network management processing element is a member of a cluster, one member being a head of the cluster updating the cluster state, and members of the cluster following the cluster state. The method comprises the cluster head monitoring the network management processing load across the members of the cluster, and, upon detecting that the cluster load is unbalanced, the cluster head updating the cluster state to initiate automatic rebalancing of the network management processing load across at least a subset of the plurality of members of the cluster, once tasks being processed by the subset of the plurality of members have been completed.
In some examples, automatic rebalancing of the network management processing load across at least a subset of the plurality of members of the cluster may take place once all tasks being processed by the subset of the plurality of members have been completed.
Unbalanced cluster load may be detected as a consequence of domain activity including for example new or removed network elements, changes in functions activated on network elements, or changes in the states of network elements. Alternatively or in addition, unbalanced cluster load may be detected as a consequence of cluster activity, including for example change in cluster membership following addition or removal of members, or changes in the network management processing load being handled by individual cluster members.
In some examples, the subset of the cluster may comprise all members of the cluster. In other examples, the subset may comprise fewer than all members of the cluster.
In some examples, the step of rebalancing the network management processing load may comprise suspending operation of each member of the subset of the plurality of members upon completion of processing of current tasks, and automatically rebalancing the network management processing load upon suspension of all members of the subset. Suspending operation may comprise suspending initiation of new tasks while current tasks are completing.
In some examples, the step of automatically rebalancing the load may comprise adding network management processing elements to the subset from a pool of started members of the cluster, or removing network management processing elements from the subset to the pool of started members of the cluster.
In further examples, members may be started up and added to, or stopped and removed from, the pool according to specified pool maximum and minimum occupancy limits. Members of the pool may comprise network management processing elements that have been started but do not have any network management processing load allocated to them.
In some examples, the step of automatically rebalancing the load may comprise at least one member of the subset running a load balancing algorithm to set the network management processing load handled by the network management processing element according to processing load data shared between cluster members.
In some examples, the method may further comprise detecting that the cluster has changed state from a first state to a second state, and changing the state of members of the cluster from the first state to the second state once tasks being processed have been completed. In some examples, the first state may be a running state and the second state may be a suspended state.
In some examples, the step of changing the state of cluster members may comprise suspending operation of members of the cluster upon completion of processing of current tasks, and changing the state of members of the cluster from the first state to the second state.
In some examples, the method may further comprise checking the cluster state on occurrence of a trigger event, wherein a trigger event may comprise at least one of expiry of a time period or a network event. In some examples, the time period may be a repeating time period, such that checking the cluster state is carried out on a periodic basis, which basis may be superseded in certain examples by occurrence of a network event.
According to another aspect of the present invention, there is provided a computer program product configured, when run on a computer, to execute a method according to the first aspect of the present invention. Examples of the computer program product may be incorporated into an apparatus such as a network management processing element. The computer program product may be stored on a computer-readable medium, or it could, for example, be in the form of a signal such as a downloadable data signal, or it could be in any other form. Some or all of the computer program product may be made available via download from the internet.
According to another aspect of the present invention, there is provided a method of distributing the processing for controlling a communication network. The network comprises a plurality of nodes, and is controlled by a cluster of a plurality of processing elements, each processing element being a member of a cluster. One member of the cluster is a head of the cluster updating the cluster state, and members of the cluster follow the cluster state. The method comprises conducting steps according to the first aspect of the present invention.
According to another aspect of the present invention, there is provided a network management processing element operating as a member of a cluster comprising a plurality of network management processing elements, one member of the cluster being a head of the cluster configured to update the cluster state. The network management processing element comprises a detector configured to detect updating of the cluster state, and a load balancer configured to rebalance the network management processing load handled by the network management processing element with reference to at least a subset of the plurality of members of the cluster once tasks being processed by the subset of the plurality of members have been completed.
In some examples, the processing element may further comprise an element state manager configured to suspend operation of the element.
In some examples, the detector may be configured to detect that the cluster has changed state from a first state to a second state, and the element state manager may be configured to change the state of the network management processing element from the first state to the second state once tasks being processed have been completed.
In some examples, the load balancer may be configured to run a load balancing algorithm to set the network management processing load handled by the network management processing element according to processing load data shared between cluster members.
In some examples, the network management processing element may further comprise a checking unit configured to check the cluster state on occurrence of a trigger event, wherein a trigger event comprises at least one of expiry of a time period or a network event. The detector may be configured to detect updating of the cluster state on the basis of information provided to the detector by the checking unit.
In some examples, the network management processing element may further comprise a monitor configured to monitor the network management processing load across the members of the cluster, and a cluster state manager configured to update the cluster state on detecting that the monitored network processing load is unbalanced.
In some examples, the network management processing element may further comprise an identity unit configured to determine if the network management processing element is to operate as the cluster head.
According to another aspect of the present invention, there is provided a network management processing element operating as cluster head of a cluster comprising a plurality of network management processing elements. The network management processing element comprises a monitor configured to monitor the network management processing load across the members of the cluster, and a cluster state manager configured to update the cluster state on detecting that the monitored network processing load is unbalanced.
According to a further aspect of the present invention, there is provided a network management processing element operating as a member of a cluster comprising a plurality of network management processing elements, one member of the cluster being a head of the cluster configured to update the cluster state. The network management processing element comprises a processor and a memory, the memory containing instructions executable by the processor whereby the network management processing element is operative to detect updating of the cluster state, and rebalance the network management processing load handled by the network management processing element with reference to at least a subset of the plurality of members of the cluster once tasks being processed by the subset of the plurality of members have been completed.
For a better understanding of the present invention, and to show more clearly how it may be carried into effect, reference will now be made, by way of example, to the following drawings in which:
Aspects of the present invention provide a method for distributing network management processing load across a plurality of network management processing elements. Examples of the method allow for decentralised dynamic load balancing of network management processing load across multiple instances of a network management application.
With reference to
Apparatus for conducting the method described above, for example on receipt of suitable computer readable instructions, may be incorporated within network management processing elements.
With reference to
The detector 240 is configured to detect that a cluster state has been updated. The detector 240 may detect an update in cluster state on the basis of information provided by the checking unit 260, which may be configured to make a periodic check of the cluster state and inform the detector 240 of the cluster state, so allowing the detector 240 to detect an update of cluster state from a previous state to a new state. The load balancer 250 is configured to effect load balancing of network processing load between members of a subset of the cluster. The load balancer 250 may be configured to run a load balancing algorithm to set its network management processing load according to processing load data shared between cluster members. The network management processing load may be set on the basis of members of the subset including one or more members that may be added from or removed to a pool of members that are stated but do not have any allocated load. The detector 240 initiates load balancing by the load balancer 250 on detecting an updated state of the cluster. This initiation may be conducted by way of the element state manager 270. The element state manager may be configured, on receipt of information from the detector 250, to suspend operation of the element once current tasks have been completed, and then to update the state of the element to reflect the updated state of the cluster. Once all members of the subset have a state reflecting that of the cluster, the state element manager may trigger the load balancer 250 to conduct load balancing.
As discussed above, one member of the cluster is a cluster head, and updates the state of the cluster. Any member of the cluster may operate as cluster head, and the element 200 may comprise an identity unit 280, which may be configured to determine if the element 200 is required to function as the cluster head. This may be determined on the basis of algorithms programmed in the identity unit 280. The element 200 may further comprise a monitor 292 and cluster state manager 294, grouped together in a cluster management unit 290. If the identity unit 280 determines that the element 200 is required to operate as the cluster head, the identity unit 280 may be configured to instruct the cluster management unit 290 to run monitoring and cluster state management. The monitor 292 may be configured to monitor network management processing load across the cluster members, and the cluster state manager 294 may be configured to update cluster status on the basis of the monitored cluster load. The functionality of the units of the processing element 200 is described in greater detail below with reference to
Referring to
Operation of the method of
As discussed above, aspects of the present invention involve load balancing of network management processing load between at least a subset of members of a cluster. Each network management processing element may be running a single instance of a network management application, and each processing element is a member of a cooperating cluster. Each cluster member operates autonomously, sharing state and data information with all other members of the cluster. Cluster state and data is then used to coordinate load balancing across members of the cluster. In the following discussion, load balancing among a subset comprising all cluster members is explained as an example, although balancing among a subset comprising only some cluster members may also be accomplished by the method, as discussed in later examples.
The state and data information shared between cluster members is illustrated in
A record 406 for each member of the cluster is held as shared information, visible to all cluster members. The Member ID 408 uniquely identifies a member in the cluster, and the Member State 410 represents the state of the identified member at a given instant. Member data 412 is any data specific to a member such as the current list of network elements being managed by that member. Information is shared among cluster members using any appropriate method of distributed information sharing. Appropriate methods for distributed information sharing may include distributed hash maps or Inter Process Communication (IPC) between cluster members.
Each cluster member may be started up or shut down according to cluster load at any given time. The sequences for member start up 500 and shut down 600 are illustrated in
Referring to
As mentioned above, the state of individual cluster members follows the state of the cluster to which they belong. This is achieved via the periodic process of
In some examples operation, the cluster state will not transition from a state Sx to another state Sy until all cluster members have the initial state Sx. Therefore, if a cluster state transition occurs from state S1 to state S2 and onto state Sn−1, the cluster state will generally not transition to state Sn−1 until all desired cluster members have reached state S2. In abnormal cases such as resetting to clear errors, the cluster state may transition to an initial state S1, forcing all cluster members to also reset. In other examples, a cluster member may abort attempts to update its state to the cluster state in the event that the cluster state is changed again. Thus if the cluster state changes from state S1 to state S2 and onto state Sn−1, and at the time of the change from state S2 to state Sn−1, a cluster member has still not transitioned from state S1 to state S2, the cluster member may abort attempts to transition to state S2 and initiate attempts to transition to state Sn−1.
Each cluster member manages its state autonomously and asynchronously, monitoring the cluster state without any reference to other cluster members and without synchronizing with other members. As long as the cluster state does not change, the cluster member need not carry out any actions.
Cluster members follow the state of the cluster by periodically running a state management sequence as illustrated in
Referring to
If a cluster member determines that it is required to act as the cluster head (Yes at step 702), the cluster member proceeds at step 704 to conduct the cluster state check periodic process illustrated in
Returning to step 708, if transition to the new state has not already been ordered (No at step 708), the member then checks, at step 714, whether transition to a state other than the new cluster state has already been ordered. If this is the case (Yes at step 714), the cluster member cancels operations to transition to that state at step 716 and then proceeds at step 718 to initiate operations to transition to the new cluster state. The member then exits the process. If no other state transition has been ordered (No at step 714), the member may proceed directly to step 718, initiating operations to transition to the new cluster state. Subsequent iterations of the state management procedure will check if the initiated operations have been completed and if so, will set the new state of the cluster member.
One example of a state transition is a transition from a running state to a suspended state to initiate and as preparation for load balancing. Operations to execute this transition may include cancelling any tasks that are in progress on management sessions towards network elements and terminating sessions towards those network elements. The precise operations required to enable a cluster member to enter the suspended state may depend upon the management application running on the cluster member. In some cases, it may be necessary to carry out complex transaction oriented shutdown of sessions with network elements to achieve an ordered shutdown. In other cases, a simple disconnection of a communication channel may suffice. In further examples, a compromise may be reached in which an attempt is made to shutdown sessions to the network elements in an orderly manner but if the sessions fail to terminate in a certain time, the connection may be timed out and the state transition effected regardless of the completion state of the remaining tasks. Tasks which are required to complete before entering a suspended state may include for example file transfer of statistics or of charging data from a network element to the management application. When a transition to a suspended state is ordered, a cluster member may defer any new file transfers but complete any ongoing transfers before suspending operations.
Another example of a state transition is a transition from a suspended state to a running state. The operation to make this transition may include load balancing, running an autonomous load balancing algorithm for the cluster member that allows it to calculate and set its management load using the shared data. The operation concludes by establishing a management session to each network element and commencing management tasks towards the network element.
In further examples, the running of a load balancing algorithm may be prompted by determining that all members of the subset amongst which processing load is to be balanced have entered the suspended state. For example, cluster members may check the member state data in the shared data of the cluster. Once all required subset cluster members have entered the suspended state, load balancing may be conducted through each cluster member running its autonomous load balancing algorithm. Once a cluster member has terminated its load balancing algorithm, it may set a flag in the shared data indicating that load balancing is complete. Once all relevant flags have been set, the cluster head, and hence cluster members, may transition back to the running state. Example transitions between running and suspended states are discussed in further detail with reference to
A simple example of a load balancing algorithm is for a cluster member to determine its position in the set of running cluster members and to manage a set of network elements based on that position. Thus, if there were 10,000 network elements to manage and 10 cluster members, the sixth cluster member would mange network element 6000 to 6999. Other algorithms can be envisaged and the management application may stipulate the algorithm to be used. Cluster members execute their load balancing algorithms autonomously and according to some examples completely independently. In other examples, peer communication between cluster members may be used as part of load rebalancing.
If a state change is triggered from the domain (Yes at step 704b), the cluster state is updated by the cluster head at step 704c. Cluster data may also be updated by the cluster head at step 704c. Typical domain cluster data may include a new list of network elements to be managed and event rate statistics for those network elements. Cluster members may use this data when executing their autonomous load balancing algorithms.
The cluster head then proceeds, at step 704d, to fire rules to determine if a state change is warranted from a cluster point of view. If a state change was not warranted from a domain point of view (No at step 704b) the cluster head may proceed directly to step 704d. The cluster rules allow the cluster head to determine, in step 704e, if a cluster state change is required on the basis of the cluster. Cluster rules may be set to determine whether cluster membership has changed due to addition or removal of members and whether the current management load is sufficiently evenly balanced. The cluster rules may also be stored as part of the shared cluster data.
If a cluster state change is triggered from the cluster (Yes at step 704e), the cluster head then proceeds to determine, at step 704f, whether the current cluster size is optimal. If it is appropriate to add or remove members from the cluster (No at step 704f), the cluster head proceeds at step 704g to order cluster membership modification by the underlying distributed computing mechanism. This may include stating up cluster members as a result of heavy overall load or shutting down cluster members in periods of light overall load. If cluster membership modification is not required (Yes at step 704f), or once it has been ordered at step 704g, the cluster head proceeds to update the cluster state at step 704h. Cluster data may also be updated at step 704h with parameters for cluster members to use when executing their autonomous load balancing algorithm.
Once the cluster state has been updated in step 704h, the cluster head then returns to step 706 of the state management method of
The process of
The above discussion uses the example of load balancing between a cluster subset that includes all members of the cluster. However, as previously mentioned, load balancing among a cluster subset including only some cluster members may also be conducted. For example it may be determined during the cluster state check process of
The subset amongst which load balancing is to be performed may also be adjusted by the addition of members from a pool or removal of members to a pool. Such a pool of cluster members may be used to alleviate rapid increases in load. A set of started cluster members may be held in the pool, with no load allocated to them. When a rapid increase in load occurs, some of the management load may be transferred to cluster members in the pool. When load decreases, cluster members may be returned to the pool. The minimum and maximum size of the pool, and the block size for increasing and decreasing the pool may be configurable. When the pool size decreases below its minimum size, new cluster members may be started and added to the pool. When the pool size increases above its maximum, cluster members may be stopped and removed from the pool.
At time t0, the cluster and all cluster members have state stopped. The application starts at time t1 with the cluster members starting up using the procedure of
One of the cluster members determines that it is to operate as cluster head and sets the cluster state to running at time t1. The periodic process of cluster members M1 to M6, detailed in
From time t2 to time t3, the application executes as normal; no load balancing is performed. At time t3, the cluster head detects, through the process of
The cluster head again sets the cluster state to running at time t5. The periodic processes of cluster members M1 to M6 once again read the cluster state change and trigger execution of their operations to transition from state suspended to state running. Load balancing is again executed as part of the transition to state running and sessions towards network elements are established again. By time t6, all cluster members have state running.
From time t6 to time t7, the application executes as normal; no load balancing is performed. At time t7, the cluster head detects, through the process of
According to a variation of the above example, load balancing may be carried out once cluster members M1 to M6 establish from the shared data that each other member has set its state to suspended. Each member may then set a flag once it has finished load balancing. Once all flags are set, the cluster head sets the cluster state to running and this is followed by the cluster members.
The above described method may be used for load balancing of processing load for a wide range of management applications. One example includes the collection and processing of events from network elements. In such a management application, network elements may be added to or removed from the network at any time. Many instances of a distributed management system are used to collect and process events from the individual network elements and the number of instances running at any given time may vary based on the current network size, the current event load, as a result of failure of particular instances. The load balancing method described herein may be used in such a system to balance the management load fairly across all running instances.
Aspects of the present invention thus provide a method enabling management applications to automatically balance their processing load across many distributed instances in an autonomous manner. The use of cooperating clusters in which one member acts as a cluster head, and cluster members follow the state of the cluster head, enables management applications to automatically handle changes to the domain such as addition or removal of network elements, and to automatically adjust its load across currently running instances. Addition or removal of management instances may also be automatically accounted for to accommodate changes in processing load and optimise power consumption or following failure of an application instance.
Examples of the present invention offer both robustness and flexibility. Once two or more instances of a management application are running according to aspects of the present invention, no one single point of failure exists, as failure of any instance will be compensated for by balancing among the remaining instances. Failure of the cluster head instance will simply result in a new member determining that it is to operate as cluster head next time the periodic process of
Examples of the present invention are highly configurable: the domain and cluster rules for triggering load balancing may be set and amended at each run of the state management process. The load balancing algorithm used by cluster members can range from simple to highly complex, and can also be amended at run time. In addition, the method is applicable to a wide range of applications designed to run across distributed instances.
It should be noted that the above-mentioned examples illustrate rather than limit the invention, and that those skilled in the art will be able to design many alternative embodiments without departing from the scope of the appended claims. The word “comprising” does not exclude the presence of elements or steps other than those listed in a claim, “a” or “an” does not exclude a plurality, and a single processor or other unit may fulfil the functions of several units recited in the claims. Any reference signs in the claims shall not be construed so as to limit their scope.
Filing Document | Filing Date | Country | Kind |
---|---|---|---|
PCT/EP2013/072195 | 10/23/2013 | WO | 00 |