1. Field of the Invention
The present invention relates to the field of element management layer servers in telecommunication networks and in particular a method for automatically overcoming possible failures in such servers. Furthermore, the present invention relates to a computer product adapted to perform the method steps.
2. Description of the Prior Art
As it is known in the art of telecommunications, network elements are at least partially managed by servers through proper software tools. These management software tools are organized in a Telecommunication Management Network (TMN) hierarchy, which consists in a set of standardized protocols that create a layered architecture used for monitoring and managing telecommunications equipment, thus enabling highly complex networks to be managed as a single cohesive unit. The lower management layer of the TMN hierarchy consists in the Element Management Layer, briefly termed “EML”. EML deals, for instance, with managing alarms, configuring the network apparatus, performing back-up and restore mechanisms (both for data and the software) and collecting performance monitoring information (detection of power consumption, temperature, available resources and others).
An EML server could incur problems because of different reasons. For instances, when configuration data and/or configuration sequence-order of the network element are not consistent to those designed, the EML server could fail. An EML server could fail also because of a software bug.
Presently, when a problem arises, the server becomes completely failed. Generally, the telecom service provider is not able to overcome the problem and contacts the infrastructure designer/provider. Whilst sometimes the problem could be overcome rather easily by the infrastructure provider, the time from problem notification to problem solution is of the order of some hours or even days. This is just because the service provider has to detect the problem and notify it to the telecom infrastructure provider; in turn, the infrastructure provider has to find the proper solution, possibly by testing an in-house server; and finally, it has to instruct the service provider accordingly. Finally, the service provider has to take the suggested action.
The Applicant has observed that the time elapsed from problem detection to problem solution is deemed to be too high and could be profitably reduced, thus reducing the operating and maintenance cost of the whole telecommunication network (operating expenditure or “OPEX”). Thus, the Applicant has faced the general problem to reduce the OPEX of a telecommunication network. More in detail, the problem is how to reduce the maintenance time and the downtime of an EML server in a telecommunication network. From service provider point of view, having a quick feedback about an unexpected error is strategic for his business.
These and further problems are solved by the method according to claim 1 and by the computer product according to claim 15. Further advantageous features of the present invention are set forth in the respective dependent claims. All the claims are deemed to be an integral part of the present description.
According to a first aspect of the present invention a new method for automatically overcoming a failure and/or an error in an EML server is provided. Finally, according to a second aspect of the present invention, a new computer product is provided.
According to the new method, the EML server is composed by several active Units having substantially the same basic structure. Furthermore, a common error management for all the Units is provided through an error collector and an error supervisor. The Units periodically send error and status information to the error collector. The error collector, by processing this information coming from the Units, is able to determine whether a Unit is affected by an error. The processed error and status information are then sent to the error supervisor, which further processes this information and decides, through a suitable failure model, the workaround actions to be performed on the Unit affected by error. The workaround actions are finally executed, without any manual intervention of an external operator.
With the method according to the present invention, managing the error detection and the workaround procedure in an EML server is simplified. More particularly, such kind of method allows the self-detection of errors and the automatic activation of workarounds. If the automatic workaround is successful, no time has been spent nor by the service provider, neither by the network provider, to fix the problem. Besides, in case the automatic workaround does not fix the error, the network provider will be able to find more quickly a solution for such an error, as a number of hypothesis about the error cause can be discarded a priori. Hence, in both cases the EML server according to the present invention allows a reduction of the OPEX of the telecommunication network.
According to a first aspect of the present invention, a method for automatically overcoming a failure or an error in an EML server is provided. The method comprises the steps of: identifying one or more Units in said EML server; providing an error collector; providing an error supervisor; defining a failure model; notifying the error collector of the status of Units; processing Unit status information through said failure model, in said error supervisor; and instructing, through said error supervisor, the Units about workaround-actions to be taken.
The step of notifying the error collector of the status of Units is preferably carried out by said Units which send to said error collector status and/or error indication messages.
Preferably, the method further comprises a step of identifying in said EML server one or more Core-Units, said Core-Units being able to send to said error collector different core metrics.
Preferably, the step of processing Unit status information through said failure model comprises selecting workaround-actions from a set of predefined workaround-actions.
The Failure Model can be either static, dynamic or probabilistic.
The error supervisor can profitably store the taken workaround-actions in a proper log or memory.
Profitably, each sub-component individually communicates with said error collector and performs the workaround-actions according to the instructions from said error supervisor.
The method further comprises the steps of identifying, for each of said Units, a type of Unit; and defining, for each type of Unit, a set of predefined workaround-actions. The set of predefined workaround-actions may comprise workaround-actions which are aimed to move the Units affected by a failure or an error to a stable condition.
According to a possible implementation of the present invention, said workaround-actions to be taken comprise one or more of the following actions: restart, reset and restore.
Profitably, the error collector stores error reports coming from components which are external to the EML server. Preferably, the error collector stores the most meaningful indications in a log or memory.
The step of identifying, for each of said Units, a type of Unit may comprise the step of classifying said Units as permanent stated component, dynamic stated component and stateless component.
According to a different aspect, the present invention provides a computer product comprising computer program code means adapted to perform all the steps of the above method when said program is run on a computer. The computer product comprises a computer program or a computer readable storage medium.
According to a further aspect, the present invention provides a network element comprising a computer product as set forth above.
The present invention will become clear in view of the following detailed description, given by way of example and not of limitation, to be read in connection with the attached figures.
In the drawings:
a and 3b schematically show the structure of examples of Units as dynamic stated component and permanent stated component, respectively; and
a and 4b show examples of external error reporting, from an agent and from a user, respectively.
The EML is also connected to its client layer in the TMN hierarchy, i.e. to the Network Management Layer (NML), which includes client supporting protocols, such as:
From the functional point of view, the EML comprises two separated entities, as shown in
As mentioned before, according to the present invention the EML server is divided in components called “Units”. For instance, according to a preferred embodiment of the present invention, such Units can comprise the following types of Units:
Typically, an EML client comprise all the GUIs, in order to interface directly with the users (e.g. operators and software developers).
The main advantage of such a subdivision in Units is that the complexity of each Unit is less than the one of the entire EML server architecture. Therefore, in order to validate a new software version of a Unit it is simply possible to perform a combinatorial sequence of tests, which is often an exhaustive sequence. A simple automatic test system could support the Unit testing/validating at development phase.
On the other hand, such a subdivision in Units results in difficulties to verify the interaction of the Units (both spatial and temporal interactions) over the entire system. A Unit could register itself on neighbour Units (spatial interactions) and receive from them messages at every time (temporal interactions). The “combinatorial test-case generation technique” could provide an ineffective set of tests, as it could not cover all possible interactions at integration phase, thus resulting in difficulties to provide an exhaustive description of the EML server from the error point of view, as it will be explained in detail hereinafter.
According to the present invention, each Unit is responsible for informing an error collector EC (shown in
With further reference to
Similarly, the Core-Units send to the error collector EC different core metrics CM. Such core metrics could include, for example:
The error collector EC collects the Unit Status Indications (USI) and the Unit Error Indications (UEI) coming from the different Units, as well as the core metrics CM coming from the Core-Units. In addition, the error collector EC stores the error reports coming from components which are external to the EML server, i.e. agents and users, as it will be described herein after with reference to
The EML server architecture according to the invention interacts with an error supervisor ES. Referring to
The main task of the error supervisor ES is to correlate the error and status indications from the error collector EC with other information (e.g. the core metrics (load indication or “LI”) provided by the Core-Units) and consequently decide, through a proper failure model, the workaround-actions (WA in
In order to simplify the processing performed by the error supervisor ES, all the Units implement the same mechanisms for detecting errors, as well as the same mechanisms for transmitting Unit Error Indications UEI to the error collector EC. Possibly, the Units can transmit Unit Error Indications UEI to registered Units.
Furthermore, in order to simplify the management of the workaround, a common set of workaround-actions is defined. The definition of the possible actions is based on the type of Unit. Three types of Units can be found:
a and 3b schematically show the structure of examples of Units as dynamic stated component and permanent stated component, respectively.
Referring to
For example,
b shows an example of a Unit as a permanent stated component comprising a non-volatile memory sub-component M, two input queues Qin, Qis, two output queues Qon, Qos, two control sub-components Cns, Csn and a Calculus/View sub-component C/V. When the Unit represented in
According to the invention, each above-mentioned type of Unit is characterized by a set of supported workaround-actions. According to a preferred embodiment of the present invention, these actions are aimed to move the Unit affected by error to a stable condition, i.e. a condition in which the effects of the error are minimized. As the most stable condition is deemed to be the initial Unit inter-working, the actions decided by the error supervisor ES are aimed to move the Unit towards its initial condition (i.e. startup or default state). Hence, for stateless Units, the supported actions are:
For dynamic stated Units, the supported actions are:
For permanent stated Units, the supported actions are:
It is remarked that the above actions are only a set of possibilities among a larger number of possible actions.
The error supervisor ES decides the action to perform on the Units affected by errors basing on a “failure model” (
The failure model comprises a description of the EML server from the error point of view. In particular, the failure model comprises a description of the interaction between Units and of the functional dependence between Units from the error point of view. In other words, the failure model associates an error status to a given set of Error Indications (coming from the Units), Status Indications (coming from the Units) and Load Indications (coming from the Core Units).
According to the type of description of the EML server, the failure model can be:
As mentioned above, the EML server according to the present invention also supports external error reporting.
Besides, an error indication can also be generated by a user who finds an error, as shown in
It has to be noticed that, in transmission networks requiring dynamic failure model, the results of both the error reporting from agents, and the error reporting from users can be employed to dynamically update the failure model.
The EML server according to the present invention exhibits many advantages. First of all, the overall time for fixing an error is reduced with respect to the know solutions, as the workaround procedure is automatically activated and managed by the EML server, and no manual intervention of the network provider is required. Hence, all feedbacks between service provider and network provider, required for a known workaround procedure according to the prior art, which in general requires days or even weeks, are avoided. In many cases, if automatic workaround is successful, the network continues working without loosing time for waiting to fix the error (“downtime”). Anyway, even if automatic workaround is not successful, the network provider is able to search the solution without affecting the downtime of the network, with an overall reduction of the OPEX of the network.
Number | Date | Country | Kind |
---|---|---|---|
04 292 703.8 | Nov 2004 | EP | regional |