The invention relates to control systems and more particularly to the monitoring and supervision of the behavior of power delivery control systems.
Power delivery control systems include supervisory control and data acquisition (SCADA) systems, distribution management systems (DMS), energy management systems (EMS) and generation management systems (GMS). Such systems are required to have high availability and reliability because of their importance to the public. Since the first development of remote control systems for power plants in the 1920's and 1930's (such as by ASEA AB and the Brown Boveri Company) power control systems have evolved into powerful and sophisticated information technology (IT) systems, which enable utilities to efficiently manage their grids and reliably deliver power to end users. In order to maintain the reliability and availability of these sophisticated control systems, it is important to detect and identify failures in the control systems and to analyze error conditions after they have occurred in order to prevent failures in the future.
Software tools are available for monitoring and supervising the behavior of computers. These tools, however, are often provided by computer vendors and/or operating system vendors and are usually limited to operating system (O/S) tools. Moreover, these and other available conventional tools have the following drawbacks: (1.) the tools cannot collect and store data at the frequency needed in control systems (i.e., sub-second in some cases); (2.) the data available from the tools is not granular enough (in time intervals and in data values) for troubleshooting; (3.) the tools have large CPU/memory/disk/network footprints, which can further cause problems to the system; (4.) tools from different vendors are not able to correlate data because they collect data at different cycle times; and (5.) data is presented in different formats or displays which makes the viewing and analysis of the data tedious, error-prone and slow.
Specific example of conventional O/S tools include O/S standard counter tools (i.e. vmstat, iostat) and O/S customized kernel tools (i.e. glance, collect). The data produced by the O/S standard counter tool is not historical in nature and is not tied to any particular process action. Data provided by the O/S customized kernel tools is not granular enough, is not standardized and is not tailored to the application running on the system. Thus, data from different types of resources used by the application cannot be gathered and stored in one collection cycle.
Data provided by application tools (e.g. power system application, database applications, etc.) in general are very granular and are not designed for historical usage or run time monitoring. In addition, these tools were developed to diagnosis specific and isolated problem areas and are typically not adapted for use in an environment utilizing a combination of systems, servers, applications and communication protocols, such as a modern power delivery control system. Individual tools running in combination utilize a significant amount of hardware and software resources, which increases the overall load on the system, which runs counter to a reason for using the tools in the first place.
Based on the foregoing, there is a need for a monitoring tool adapted for use in control systems. The present invention is directed to such a monitoring tool.
In accordance with the present invention, a computer-implemented method is provided for monitoring the operation of a control system adapted to monitor and control a power delivery operation. The control system has a plurality of computers. Each of the computers has a central processing unit (CPU) with an operating system running thereon and each of the operating systems has one or more kernel interfaces. In accordance with the method, a determination is made of what processes are running on each computer. Threshold criteria for data collection is received. Initial data is collected based on the processes running on each computer. The initial data is time stamped. A determination is made whether the initial data meets any of the threshold criteria. If the initial data meets any of the threshold criteria, additional data is collected based on the threshold criteria that has been met. The additional data is time stamped. The initial data and the additional data is correlated using the time stamps and/or characteristics of the running processes. The correlated, collected data is then displayed. The kernel interfaces are used to collect the initial set of data and the additional data.
Also provided in accordance with the present invention is a control system adapted to monitor and control a power delivery operation. The control system includes field equipment, one or more remote terminal units (RTUs) associated with the field equipment and a plurality of computers including a data acquisition computer adapted to receive data from the one or more RTUs and an application computer operable to execute an application for controlling one or more of the field devices. Each of the computers includes a central processing unit (CPU) and computer readable media containing operating system software with one or more kernel interfaces and a monitoring tool, which when executed by the CPU performs a method of monitoring the operation of the computer. In accordance with the method, a determination is made of what processes are running on the computer. Threshold criteria for data collection is received. Initial data is collected based on the processes running on the computer. The initial data is time stamped. A determination is made whether the initial data meets any of the threshold criteria. If the initial data meets any of the threshold criteria, additional data is collected based on the threshold criteria that has been met. The additional data is time stamped. The initial data and the additional data are correlated using the time stamps and/or characteristics of the running processes. The correlated, collected data is then displayed. The kernel interfaces are used to collect the initial set of data and the additional data.
The features, aspects, and advantages of the present invention will become better understood with regard to the following description, appended claims, and accompanying drawings where:
It should be noted that in the detailed description that follows, identical components have the same reference numerals, regardless of whether they are shown in different embodiments of the present invention. It should also be noted that in order to clearly and concisely disclose the present invention, the drawings may not necessarily be to scale and certain features of the invention may be shown in somewhat schematic form.
Referring now to
Each operator console 12 may be a personal computer (PC) with a central processing unit (CPU) and a monitor for providing visual displays to an operator. A graphical user interface (GUI) runs on each operator console 12 and is operable to display a plurality of different views or windows on the monitor.
Each of the computers in the SCADA system 10 has an operating system. As is well known, an operating system is system software responsible for the control and management of computer resources. A typical operating system enables communication between application software and the hardware of a computer. The operating system allows applications to access the hardware and basic system operations of a computer, such as disk access, memory management, task scheduling, and user interfacing. Additionally, an operating system is also responsible for providing network connectivity. The operating systems of the computers in the SCADA system 10 may be Windows® operating systems available from Microsoft Corporation or different types of Unix operating systems (Tru64, HP-UX, Linux) available from vendors such as Hewlett Packard, Sun Microsystems, IBM, RedHat, SUSE, etc.
Each operating system of each computer in the SCADA system 10 has kernel interfaces. A kernel interface is a low-level communication bridge between a process (application program) and the kernel of the operating system. A kernel interface for a process typically includes entry points and calling conventions. Entry points are where the execution of a process enters the kernel, while calling conventions control how function arguments are passed to the kernel and how return values are retrieved.
A kernel interface of an operating system may include functions to start and stop threads of execution, to synchronize threads of execution, to read and write data from files, to enlist in regions of shared memory, to draw on a screen, to communicate with a network, or to read a system clock. Kernel interfaces can range from dozens of entry points to hundreds and even thousands of entry points.
A kernel interface is one component of the operating system's application programming interface (API) presented to programmers. An API typically defines the interface between an application programmers source code and operating system libraries so that the same source code will compile on any system supporting the API. Many conventional operating systems encapsulate the kernel interface within a higher-level API because the kernel interface varies from version to version of the kernel.
The operator console(s) and the historian computer 14 are connected to a first network 24, while the applications server computers 18 and the data acquisition server computers 20 are connected to a second network 26. The first and second networks 24, 26 may each comprise a pair of redundant Ethernet cables over which information is communicated using the TCP/IP protocol.
The RTUs 22 are microprocessor-based devices that are associated with field equipment in an electrical distribution system, such as re-closers, relays, switches, interrupters, capacitor banks, etc. The RTUs 22 are essentially communication interfaces and may be integrated directly into the field equipment, or, more often, are connected to the field equipment as external communication units. The RTUs 22 are periodically interrogated or “polled” by data acquisition server computers 20, and the RTUs 22 respond with data gathered from their associated field equipment or systems. The interrogation may be specific (meaning that the data acquisition server computers 20 contact a specific RTU 22 using some form of unique identifier), or global (in which case the data acquisition server computers 20 sends a single poll and all of the RTUs 22 that receive the poll respond sequentially according to some predetermined order of response). The RTUs 22 may communicate with the data acquisition server computers 20 using the distributed network protocol (DNP), which is an open protocol that is used for control and communication in SCADA systems. The DNP handles three types of data: binary, analog and counter. Counter data represents count values, such as kilowatt hours, that increase until they reach a maximum and then roll over to zero and start counting again. The DNP is not a general purpose protocol for transmitting hypertext, multimedia or huge files. DNP 3.0 is the most current version of DNP.
The application server computers 18 may execute one or more applications for controlling and/or optimizing the performance of the power delivery operation using field information, such as may be obtained from the data acquisition server computers 20.
Copies of a monitoring (software) program or tool 50 run on one or more of the computers of the SCADA system 10, such as on the operator consoles 12 and/or the applications server computers 18 and/or the data acquisition server computers 20. A computer upon which a copy of the monitoring software program runs shall hereinafter be referred to as a “host computer”. In one embodiment of the present invention, a copy of the monitoring tool 50 is installed on each computer in the SCADA system 10. In this embodiment, the copies of the monitoring tool 50 communicate with each other and one of the copies may act as a “master” that can control the other copies. Each copy of the monitoring tool 50 may be executed automatically upon start-up of the host computer, upon command of an operator, or upon command of another computer program. In one embodiment of the present invention, the monitoring tool 50 is an autonomous independent process that is started automatically at operating system reboot on all computers that are part of the SCADA system 10.
On each host computer, the monitoring tool 50 is stored in memory and executed by a central processing unit (CPU) of the host computer. The monitoring tool 50 includes a graphical user interface (GUI) that is displayed on a monitor of the host computer. In order to present data for pattern recognition, the GUI may display all collected data at the same time interval with graphs of data where appropriate. In addition, the GUI may be used to: (i) filter on a particular data type and time on run-time or on historical data, (ii) generate and display graphs and detailed data in a side-by-side manner, (iii) generate and display color coded high water marks for data, and (iv) generate and display landscapes of processes in a manner that allows for easy identification of trouble spots.
The monitoring tool 50 has the following run time properties: (1.) the monitoring tool 50 executes constantly as a background process; (2.) the monitoring tool 50 is natively compiled with a very small compact data and code section; (3.) the monitoring tool 59 is capable of running on multiple operating platforms; (4.) the monitoring tool 50 uses vendor supplied kernel interfaces for data sources; (5.) the monitoring tool 50 does not use kernel symbol mapping methods; (6.) the monitoring tool 50 runs under the context of a privileged user; (7.) the monitoring tool 50 favors a higher priority setting than most other applications; (8.) the monitoring tool 50 runs on an internal cyclic timer at a frequency of under one second; and (9.) the monitoring tool 50 does not, at any time, consume more than five percent of the CPU processing utilization of its host computer; and (10.) the monitoring tool 50 utilizes less than 10 percent of 1 percent of the total memory of its host computer.
The monitoring tool 50 initially creates space in the memory of the host computer to store selected process/thread data for up to 10000 separate entries. Entries in the internally maintained thread/process table are indexed using a simple hash formula of:
HASH_INDEX=(THREAD_ID<<1% 10000).
Where
The monitoring tool 50 monitors and analyzes the process structure of the host computer. The process structure of the host computer is the various executable software programs (processes) that are installed and runnable on the host computer. Referring now to
Initially an inventory is taken of machine vital configuration information such as number of disks/CPUs, memory size and network capacity. This configuration information may be included in a report generated by the monitoring tool 50. On the start of a collection cycle, standard metrics for CPU, memory, disk, process/thread, inter-process communication methods and network usage are collected, stored, collected again and then analyzed. When a configurable threshold of usage has been crossed for any of the standardized metrics or a combination of metric rules has been met, a detailed collection processing is started within the same processing cycle. This incident information is stored into a separate time-stamped file. Once an incident has been reported, a backoff time algorithm is used to prevent incident files from appearing at a high frequency.
The monitoring tool 50 collects data from a host computer using only kernel interfaces. The monitoring tool 50 may selectively collect data for a particular time interval from each process using the kernel interface to the process. The data that may be collected by the monitoring tool 50 from the processes of a host computer include CPU, disk, memory, network and process statistics; utilization percentages (overall, process); process/thread state percentages; run queue averages; interrupts processed; context switches performed; and system calls made. The data that is collected by the monitoring tool 50 is time stamped using a system timestamp of the SCADA system 10 or external clocks.
The monitoring tool 50 permits the correlation of different data types using timestamps and/or process analysis. For example, the monitoring tool 50 may correlate data from those processes that execute at the same time, as determined from time stamps. The monitoring tool 50 may also correlate data from those processes that use the same file I/O. The monitoring tool 50 may also trace process dependencies, i.e., parent-child relationships, such as are shown in the tree of
As set forth above, the monitoring tool 50 uses threshold settings to start and stop the collection and storage of detailed data. The threshold settings may be user-selectable through the GUI of the host computer. Collected detailed data may be stored in, and retrieved from, the historian server computer 16 by the historian computer 14. An example of the use of a threshold setting to start and stop detailed data collection may be system CPU utilization of 75%. When system CPU utilization is greater than 75%, the monitoring tool 50 may start to collect, timestamp and store all data available from the kernel interfaces. When the system CPU utilization thereafter falls below 65%, the monitoring tool 50 may stop collecting and storing data available from the kernel interfaces. Another example of a threshold that may be used to start data collection is file swapping. If file swapping becomes active, data on memory usage may be collected, time stamped and stored.
More examples of detailed collection processes and the metrics (thresholds) that trigger them are as follows:
None of the detailed data collection processes impedes the host computer or the SCADA system 10. The detailed data collection processes do not affect any of the running threads/processes. The call frame reports are done with and without program symbols. Combinations of varying standard metrics thresholds are used to generate detailed collection reports.
An example of a detailed call frame analysis on a thread would be:
An example of the elements contained in the detailed file activity report on a per thread basis would be:
An example of a detailed socket activity report on a per thread basis would be:
Standard metrics are collected and stored into a data memory queue by the monitoring tool 50. A separate log archiving process reads the data memory queue and writes a compact secure binary record to a disk (non-volatile memory) of the host computer. The data written to disk is appended with a time-stamp (in milliseconds granularity) and a record identifier. The data can be stored for up to one month in a revolving fashion.
Exceeding a threshold setting may also be used to increase or reduce the rate at which data is captured and stored.
A flow chart of a method performed by the monitoring tool 50 is shown in
As illustrated in
Based on the analysis of collected data, recovery actions may be invoked, such as (i) process restarts, (ii) forced shutdown of zombie processes, and/or (iii) forced shutdown of other processes to shed load from available relevant system parts.
As can be appreciated from the foregoing description, the monitoring tool 50 of the present invention has a number of beneficial features, while meeting control system standards for reliability, availability and security. The monitoring tool 50 provides an optimized process for collecting, analyzing, correlating and historically storing operating system and application resource metrics on multiple operating system platforms. The process collects control system specific resource metrics at a high frequency while maintaining a minimal resource usage footprint. The resource metrics are collected at the same time interval and are tailored for support of highly available near real-time systems. Data of different types (e.g. application and system specific) is collected and correlated and may be used for post-disturbance analysis. The GUI of the monitoring tool 50 allows display of all collected data at the same time interval with graphs of data where appropriate.
The monitoring tool 50 permits thresholds of different resource metrics to be applied at the same time, which enables the monitoring tool 50 to identify processes/threads causing instability, which, in turn, enables more detailed process/thread information to be collected in the same collection cycle. All of this facilitates root cause analysis and recovery actions. Alerts can also be generated for early warning notification. A historical state of the process/threads is maintained in non-volatile memory of the host computer and/or in the historian server computer 16.
The speed of the monitoring tool 50 is particularly advantageous and enables the monitoring tool 50 to be used in important control and monitoring applications, such as in the SCADA system 10. For example (and with reference to the flow chart in
As will be appreciated by one of ordinary skill in the art, the present invention may be embodied as or take the form of the method and system previously described, as well as of a computer readable medium having computer-readable instructions stored thereon which, when executed by a processor, carry out the operations of the present inventions as previously described and defined in the corresponding appended claims. The computer-readable medium may be any medium that can contain, store, communicate, propagate, or transport the user-interface program instruction for use by or in connection with the instruction execution system, apparatus, or device and may by way of example but without limitation, be an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, device, or propagation medium or other suitable medium upon which the program is printed. More specific examples (a non-exhaustive list) of the computer-readable medium would include: a portable computer diskette, a hard disk, a random access memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or Flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a transmission media such as those supporting the Internet or an intranet, or a magnetic storage device. Computer program code or instructions for carrying out operations of the present invention may be written in any suitable programming language provided it allows achieving the previously described technical results.
This application claims the benefit of U.S. provisional patent application No. 61/058,207 filed on Jun. 2, 2008, which is hereby incorporated by reference in its entirety.
Filing Document | Filing Date | Country | Kind | 371c Date |
---|---|---|---|---|
PCT/US09/45963 | 6/2/2009 | WO | 00 | 3/15/2011 |
Number | Date | Country | |
---|---|---|---|
61058207 | Jun 2008 | US |