1. Field of the Invention
The present invention generally relates to an apparatus and a corresponding method for event diagnosis. More specifically, the present invention relates to event diagnosis in a computerized system using classification of the different events in the computerized system leading to error correction and solving.
2. Discussion of the Related Art
Computerized systems no longer involve a single closed system and the use of multi-tier software architectures in which the database or the application servers are separate from the end user has many advantages. One benefit is that maintenance of servers and databases can be performed by a skilled person in a remote location, while the clients and users can still use the computerized system far a way from that remote location. Another benefit is the data security aspects. The data can be always backed up in a safe remote location while the clients and users can be located in areas where back up facilities are not available or are less reliable. Another benefit is the simplicity of using the same computerized online system for large organizations having few remote branches. As a result, even the simple application consists of several systems (nodes) that interact via well defined protocols. In a non limiting example, a simple user request for a web page describing product specifications in an e-commerce system may be translated by the browsing computer program into an HTTP request over TCP over IP, which incase of overcoming the fire wall and the anti-virus proxy, is load balanced by a load balancer and intercepted by a web server. The web server then delegates the request to a web container which translates this request to IIOP/RMI/SOAP procedure calls at the application server which will then modify them again to JDBC or JMS or SOAP in order to access the database or MOM (Message Oriented Middleware) or external applications via EAI (Enterprise Application Integration) interfaces and a like. A failure at a single node or tier can affect another remote node or tier or even the whole application such that the root cause of the malfunction is indirect and is difficult to discover. A typical application may generate numerous log files that need to be looked at before revealing the cause of the failure, but due to the vast amount of information gathered, cross reference between all the different utilized resources from one hand, and all the application events from the other, is substantially a challenging task. Thus, identifying the root cause of a problem is extremely difficult and requires substantial resources.
Computerized system failures can be divided into three groups. The first group is a permanent failure in which the computerized system error remains until the root cause for that error is fixed. The second group is a specific circumstance failure in which the computerized system error reoccurs only under specific circumstances. The third group is a single occurrence failure in which the computerized system error occurred once or twice. Now available monitoring tools provide minor assistance for the first and second groups and in a case of a single event that was not logged no assistance for the third group. Furthermore, a single node monitoring tool lacks the ability to perform a multi-tier analysis and ignores by a definition other environmental factors. Current multi-tier monitoring tools are designed to address specific system architecture and a monitoring tool for a first company's Enterprise Resource Planning (ERP) using a second company's database installed on a third company's server platform will not be useful for other ERP applications. One example for the lack of capabilities of currently available assisting tools is that these tools focus on optimization or monitoring of only a single component of the computerized system, and a tool monitoring the databases might recommend that a given SQL (Structured Query Language) statement should be re-written to reduce imposed I/O load while the actual problem may be a bottleneck I/O contention of fragmentation.
There is therefore a need for a multi-tier monitoring tool which is platform independent and software component independent and will take under consideration substantially all the resources from the different tiers of the computerized system. The multi-tier monitoring tool will preferably eliminate the need for looking at the different log files of the different tiers of the computerized system. The monitoring tool will preferably assist in analyzing the root cause of a failure enabling the user to manipulate the configuration of the computerized system in order to prevent the same root cause to reoccur. The monitoring tool will preferably alert the user of a possible failure before it occurred. The monitoring tool will be preferably a generic and adaptive tool in such that a share data which was acquired at one environment will be useful in a different environment.
The present invention overcomes the disadvantages of the present art by providing a new and novel method and apparatus for event diagnosis in computerized systems.
In some exemplary embodiments of the present invention there is provided an apparatus and a method for event diagnosis that does not require searching of errors and anomalies at the different log files of the different parts of the computerized system. One benefit of the present exemplary embodiment relates to error correction and error solving in large multi-tier computerized systems and environments.
In some exemplary embodiments of the present invention the apparatus and method are using classification of the various events in the computerized system according to various measurable attributes of the resource. In such a way, specific overloads and bottlenecks in resources can be easily identified by a person skilled in the art, and the root cause of a possible malfunctioning of the computerized system can then be solved. Another benefit of the present invention is that the computerized system personnel may classify and diagnose substantially all the failures that may occur during the operation of the computerized system before their occurrence simply by classification of the various events in the computerized system according to the various measurable attribute of the resource. Such system can, in some exemplary embodiments, be a network system or a combination of computerized and network systems. The classification of the events is measurable by the various attributes of each consumed resource. The measurable attributes can comprise, in some exemplary embodiments of the present invention: time; consumed time; speed; network speed; storage space; available space; space; free space; bit rate or byte rate; read or write queue length; average queue length; temporary queue length; read or write time; transfer time; idle time; split i/o; packets; packets received; packets sent; packets per sec; bandwidth; received bytes; page faults; available bytes; committed bytes; commit limit; write copies; transition faults; cache faults; demand zero faults; pages input; page reads; pages output; pool paged; pool non-paged; page writes; free system page table entries; cache; cache peak; pool paged resident; system code total; system resident code; system total resident; system total driver; packets received; packets sent; packets error; packets unknown; system driver; system resident driver; system resident cache; committed in use; processor time; user time; interrupt; threads; processes; system up time; alignment fixups; exception dispatches; floating emulations; registry quota in use; file read operations; file write operations; file control operations; file read bytes; file write bytes; file control bytes; context switches; system calls; file data operations; system up time; processor queue length; memory page faults; page file sys usage; page file sys peak; and the like.
One or more of the said attributes can be measured per seconds; bytes per seconds; seconds; bytes; bytes length; queue length; packets and the like.
In some exemplary embodiments of the present invention the apparatus and method are generating an event profile taking under consideration substantially all resources of the different tiers of the computerized system, such system can in some exemplary embodiments be substantially all of the now known or later topologies and applications.
In another exemplary embodiments of the present invention there is provided an apparatus and a method for detecting events prior to resource malfunction, a group of over consuming events, a single resource bottleneck which occurs when events are consuming the same resource, events locking situation, and a like. Such a model of event to resource relation is essential for automatic problem and root-cause detection.
Thus, in accordance with the present invention there is provided a method for diagnosis of a computerized system, the method is implemented within a computing platform, the platform comprises one or more processing units, one or more storage devices; and one or more communication devices, the method comprising the steps of collecting events or extracting data elements generated by an element of the computerized system; transforming the events or data elements to one or more event based time series, said one or more event based time series having one or more interval; determining which resources of the computerized system is consumed by which events, for a first predetermined time interval; and determining a function between the one or more event based time series and measurable attributes of the resources for the events for a second predetermined time interval. The method further comprises a step of storing events or data elements generated by an element of the computerized system in a database. The first predetermined time interval is longer than or equal to the second predetermined time interval. The second predetermined time interval is contained in the first predetermined time interval. The method further comprising a step of determining a function between the one or more event based time series and the measurable attributes of the resources for the events for a third predetermined time interval. The third predetermined time interval is different from the second predetermined time interval. The step of determining the function between the one or more event based time series and measurable attributes of the resources for the events comprises the use of minimum least square method step or by iteratively introducing weights into the said step. The event can be an event type, and the resource can be a consumed resource.
In accordance with the present invention, there is also provided an apparatus for diagnosis of a computerized system, the apparatus is implemented within a computing platform, the platform comprises a processing unit, a storage device; and a communication device, the apparatus comprising a collecting module for collecting information about the computerized system; a database for storing the information collected by the said collecting module; and an analyzing module for performing event diagnosis on the information collected by said collecting module and stored by said database. The apparatus further comprising a transforming module for transforming the information stored on said database to a predetermined form to be analyzed by said analyzing module for further processing. The apparatus further comprising a data visualization module for receiving and presenting the results of the event diagnosis performed by said analyzing module. The apparatus further comprising a display module for viewing the results of the event diagnosis received from the said data visualization module.
The present invention will be understood and appreciated more fully from the following detailed description taken in conjunction with the drawings. In the drawings like numerals refer to the same elements.
Monitoring device 100 is monitoring continuously, in some exemplary embodiments of the present invention, the consumption time of the computerized system resources. At each computerized system node or tier the data is manipulated and influence potentially the entire computerized system such that, in some exemplary embodiments of the present invention, a failure at a single point causes a failure of the request of the user 170. Permanent failures which cause the computerized system to stop functioning and specific circumstance failures that are a result of a specific chain of events may create substantial delay and damage.
In another exemplary embodiment of the present invention, the collecting module 202 can use existing tools or use the computerized system tools 100 of
A person skilled in the art will appreciate that each one or any combination of the monitoring tools can be used for collecting events or resource data. The apparatus 200 of the present invention further comprises a transforming module 220 for transforming the information stored on the database or repository 210 to a predetermined meaningful mathematical representation form to be analyzed by analyzing module 230 for further processing. Transforming module 220 transforms the computerized system events to events based time series. The transforming module 220, in some exemplary embodiments of the present invention, stores the predetermined representation form at the database 210.
Computerized system for multiple users may generate few events of the same event type therefore in some exemplary embodiments of the present invention, the events collected or data extracted by collecting module 202 are classified by event types. Event type, in accordance with the preferred embodiment of the present invention, is a computer routine or a subroutine or a function or a set of one or more computer code lines that require an input data and have an output. Different input or output of the same subroutine or a function or a set of one or more computer code lines is referred as a different event. Alternatively, events that differ in their input or output but are a result of the same computer routine or computer function are attributed to the same event type. Therefore, one event can have a longer response time than another, but yet they are of the same event type. A person skilled in the art will appreciate determining which resource of the computerized system is used by which event type instead of using each event for that determination. In non limiting examples of the present invention, event type can include SQL command or HTTP URL request or SAP transaction code. An SQL command can be “select * from table EMPLOYEE where ID=? and NAME=?”. An SQL command can also be “select ID, NAME, DATA from Employee where COMPANY=?” or “update table EMPLOYEE set NAME=? where ID=?”. An HTTP request can be any of the following:
A SAP transaction event can be ZGM_GRANT_STATUS;GPV1TRUC914. A SAP transaction event can also be ME51N; PRGV156 GH or STA05; GVPX.
In the context of the present invention, events collected or data extracted can be also described as information collected or extracted. Sniffing programs or port mirror programs can be used, in some exemplary embodiments of the present invention, for collecting network traffic, from which events or resource data can be extracted.
The apparatus 200 of the present invention further comprises a database or repository module 210 for storing the information collected by extracting or collecting module 202 or by a transforming module 220 or by an analyzing module 230 or by a data visualization module 240 or a combination of the said modules. The Database module 210 stores the information about the event generated by the element of the computerized system in a database or a repository. Any type of database device can be used as the database module 210 of the present invention. In a non limiting example, the database module 210 of the present invention is an SQL generated database, produced and manufactured by the Microsoft Corp, Washington, USA. The apparatus 200 for diagnosis of the computerized system of the present invention further comprises a transforming module for transforming the at least one event or at least one data element to an at least one event based time series as further described at
The events generated by the computerized system are monitored constantly and time tagged according to their appearance (start time) and termination (end time) while the resources are being monitored at a predetermined time intervals. A non limiting time interval is 15 seconds. In step 272 the information collected by apparatus 200 is stored at the database module 210. In some alternative embodiments of the present invention, the step of storing the information at the database module will occur after the data is transformed, analyzed or visualized as is described below. Next, in step 274 the information collected or extracted is transformed to a predetermined form to be preferably analyzed by analyzing module 230 of
In step 280 the apparatus 200 of
As shown in
At 12:22:15 (360) (12 hours, 22 minutes, 15 seconds) the CPU utilization 342 was 76 percent. The reading from DISK 1 (344) was 22 bytes per second. The writing to DISK 2 (350) was 89 bytes per second and the network transported bytes 346 were 76 per second. Next, after 15 seconds at 12:22:30 (362), the CPU 342 utilization was 21 percentages. The reading from DISK 1 (344) was 54 bytes per seconds. The writing to DISK 2 (350) was 25 bytes per seconds and the network transported bytes per second (346) were 88.
A person skilled in the art will appreciate the different resources attributes that can be measured for determining the utilization of the said different resources. A non limiting example for different resources is a logical disk; a physical disk; a processor; a computerized system or subsystem and a like. A non limiting example for the different resources attributes is any one or combination of the following: time; consumed time; speed; network speed; storage space; available space; space; free space; hit rate or byte rate; read or write queue length; average queue length; temporary queue length; read or write time; transfer time; idle time; split i/o; packets; packets received; packets sent; packets per sec; bandwidth; received bytes; page faults; available bytes; committed bytes; commit limit; write copies; transition faults; cache faults; demand zero faults; pages input; page reads; pages output; pool paged; pool nonpaged; page writes; free system page table entries; cache; cache peak; pool paged resident; system code total; system resident code; system total resident; system total driver; packets received; packets sent; packets error; packets unknown; system driver; system resident driver; system resident cache; committed in use; processor time; user time; interrupt; threads; processes; system up time; alignment fixups; exception dispatches; floating emulations; registry quota in use; file read operations; file write operations; file control operations; file read bytes; file write bytes; file control bytes; context switches; system calls: file data operations; system up time; processor queue length; memory page faults; page file sys usage; page file sys peak; and the like. One or more of the said attributes can be measured per seconds; bytes per seconds; seconds; bytes; bytes length; queue length; packets and the like. Persons skilled in the art will appreciate that any other now available or later used or developed resource attributes and measurements are contemplated by the present invention.
Such a model of event to resource relation is essential for automatic problem and root-cause detection. A person skilled in the art will appreciate the management and operational advantages of determining a model in a real time for any dynamic system.
In another exemplary embodiment of the present invention exemplary display 500 represents a graph of the consumption of a specific resource by substantially all the events or events type of the computerized system (not shown) over one hour between 12:12 to 13:12 on Oct. 10, 2004. Title 550 outlines the date and the one hour period for which graph 500 is plotted. Y axis represents the specific resource utilization and each layer represents a consumption level of a single event or event type. In a non limiting example the consumption of a first event type 542 over the diagnosed period is lower than the consumption of a second event type 532 and the consumption of a third event type 512. Peaks of the specific resource consumption are marked for all plotted event types in a legend box 502. A single resource bottleneck which occurs when all event are consuming the same resource can be easily diagnosed using the exemplary embodiment.
The person skilled in the art will appreciate that what has been shown is not limited to the description above. The person skilled in the art will appreciate that examples shown here above are in no way limiting and are shown to better and adequately describe the present invention. Those skilled in the art to which this invention pertains will appreciate the many modifications and other embodiments of the invention. It will be apparent that the present invention is not limited to the specific embodiments disclosed and those modifications and other embodiments are intended to be included within the scope of the invention. Although specific terms are employed herein, they are used in a generic and descriptive sense only and not for purposes of limitation. Persons skilled in the art will appreciate that the present invention is not limited to what has been particularly shown and described hereinabove. Rather the scope of the present invention is defined only by the claims, which follow.