Embedded systems, e.g. computers that are components in larger systems and rely on their own microprocessor, are becoming commonplace. For example, embedded systems are being used in personal electronic items such as PDAs, inkjet printers, cell phones, and car radios. Embedded systems are also becoming critical components of many industrial devices such as test and measurement systems, including the AGILENT TECHNOLOGIES J6802A and J6805A Distributed Network Analyzers.
To meet this growing demand, operating system providers, such as MICROSOFT, provide embedded versions of their normal operating systems. One of the more recent offerings from MICROSOFT is WINDOWS XP EMBEDDED (referred to herein as XPE). Embedded systems, such as XPE, provide functionality in keeping with the nature of embedded systems, including mechanisms (such as write filters) for protecting critical data, such as the operating system or files from being corrupted.
Embedded systems, much like personal computer systems generally store data in memory and/or mass storage. Mass storage may comprise, for example, a variety of persistent components, including removable and non-removable storage drives such as hard drives and compact flash media. Memory is largely comprised of non-persistent components, such as RAM. The data stored in memory is subject to corruption due to power surges, hard power downs, viruses, and so on. Even though embedded system have functionality, such as write filters, that seek to prevent corruption, they are not always effective.
Corruption can lead to intermittent failures that may compromise the operation of the system. In test and measurement systems, corruption can lead to erroneous test results, incorrect diagnosis and unnecessary repairs and troubleshooting. Typically, the sooner corruption is detected, the lighter the potential damaged caused by the corruption. Many times corrupted data may be cleared from non-persistent components by simply restarting the affected program or by rebooting. Accordingly, it is desirable that the various programs running on an embedded system be monitored to ensure that remedial action can be taken as soon as possible.
In many non-embedded systems, a video display, e.g. a monitor, is provided to facilitate monitoring the operation of the system. By watching the monitor it is possible to detect or predict some error that occur due to corruption. Further, the operating system can be configured to create a display on the monitor warning of error conditions caused by corruption or other factors.
It has become popular to use embedded systems without any form of display, monitor or otherwise. Such embedded systems are often referred to as “headless” systems. Interaction with headless systems, if any, is usually via a communication channel, such as the Internet. An example of a headless system is a distributed test device co-located with network components or other convenient access point. Distributed test devices monitor the component or access point and report the results of the monitoring using the network being monitored or some other communication channel.
Human involvement with headless test system is optimally limited to installation and activation of the system. In general, it is the goal of most embedded systems to be totally autonomous. It is thus desirable that an embedded system not only be reliable, but also self-monitoring. To achieve the goal of being both reliable and self-monitoring, embedded systems, headless or otherwise, should be programmed for the detection of conditions, such as corruption, that can cause the system to operate in a way other than intended. One known method is the use of watchdogs which track the execution of processes and execute some form of action if the watched process should fail. Failure modes include process crashes due to faulty code execution; hanging (failing to progress in a useful way); and deadlocking (stopping execution due to resources being unavailable).
Most known embedded system watchdogs are hardware based and track a single process. The action on failure is typically a reboot. Hardware based watchdogs typically operate by monitoring a specified memory or register location. The location is assigned to the process, which is modified to update the location on a regular basis, termed “checking-in.” If the location is not updated on schedule, the watchdog initiates a reboot.
Unfortunately, hardware based watchdogs only track a single process and, due to their nature, are not available on all computers. Further, the action upon failure is a simple reboot. Accordingly, the present inventors have recognized a need for apparatus and methods for tracking multiple processes and providing flexible actions upon failure.
An understanding of the present invention can be gained from the following detailed description of the invention, taken in conjunction with the accompanying drawings of which:
Reference will now be made in detail to the present invention, examples of which are illustrated in the accompanying drawings, wherein like reference numerals refer to like elements throughout. In addition to the described apparatus, the detailed description which follows presents methods that may be embodied by routimes and symbolic representations of operations of data bits within a computer readable medium, associated processors, embedded systems, general purpose personal computers and the like. The methods presented herein are sequences of steps or actions, often executed by a processor or dedicated circuits, leading to a desired result, and as such, encompasses such terms of art as “software,” “routimes,” “computer programs,” “programs,” “objects,” “functions,” “subroutimes,” and “procedures.” These descriptions and representations are the means used by those skilled in the art effectively convey the substance of their work to others skilled in the art.
The methods of the present invention will be described with respect to implementation on a headless embedded computer system using an embedded operating system. Those of ordinary skill in the art will recognize that the apparatus and methods recited herein may also be implemented on embedded systems with monitors and even on general purpose computing devices, with or without monitors. While the present invention will be described as being implemented using PC based devices operating with the MICROSOFT WINDOWS XP EMBEDDED (hereinafter XPE), the apparatus and methods presented herein are not inherently related to any particular device or operating system. Rather, various devices and operating systems may be used in accordance with the teachings herein. Machines that may perform the functions of the present invention include those manufactured by such companies as AGILENT TECHNOLOGIES and HEWLETT PACKARD as well as other manufacturers of embedded systems and general computing devices.
An indicator 124, connected to the I/O subsystem 122, provides an indication that the embedded system is operational in addition to being ON. Preferably, but not necessarily, this indication takes the form of a two color light emitting diode wherein one color is illuminated when certain conditions, generally reflective of an operational system, are present and a second color with the conditions are not present—generally indicating a non-operational system. A suitable indicator and control circuitry therefore is described in co-pending U.S. patent application Ser. No. 10/726,769 entitled APPARATUS AND METHOD FOR INDICATING SYSTEM STATUS IN AN EMBEDDED SYSTEM. The '769 applications is assigned to the assignee of the present application and incorporated herein by reference.
It is to be noted that the block diagram shown in
The present invention generally comprises a software-based watchdog comprising two parts: a registration procedure; and a watchdog program. The registration procedure facilitates registering programs to be watched (the “watched programs,” “registered programs,” or “identified programs”) and the performing of confirmation actions by the watched programs. The watchdog procedure checks for the periodic completion of the confirmation actions by the watched programs and, upon the failure to complete a confirmation action executes defined remedial actions. In accordance with at least one preferred embodiment implemented using XPE, the registration procedure is provided in a dynamic linked library file (dll) while the watchdog procedure is embodied by a service.
In accordance with perhaps the preferred embodiment, the registration procedure provides for at least three functions: register; check-in; and unregister. The register function permits a program to register with the watchdog and pass various parameters including how often the watchdog should expect the program to check in (the delta time) and what action or actions are to be taken if the program fails to check in (the remedial action(s)). The check-in function simply passes a time stamp to a common memory location to serve as the confirmation action. The unregister function removes the program from the watchdog's list of programs being watched.
The watchdog procedure periodically walks through the list of registered programs and, for those programs requiring service (as defined by the delta time) checks the memory location for the timestamp recorded by the check-in function. When the timestamp plus the delta time is less than the current time, it is deemed that a failure has occurred. The watchdog procedure logs the failure and takes a specified remedial action.
In one embodiment, remedial actions comprise executable files that are called by the watchdog procedure. In perhaps the preferred embodiment, the register function passes the watchdog procedure an array designating a series of executable files. It is also possible to simply pass a pointer to the array. The watchdog procedure maintains a counter, which can be implemented as part of the pointer, which is incremented each time a registered program fails a check. Upon the detection of a failure, an executable file from the array, selected using the counter, is executed. For example, upon the first failure the first executable file in the array is executed; upon the second failure the second executable file in the array is executed; etc. . . . This allows different remedial actions to be taken each sequential time the registered program fails. For example: the first failure could result in the program being restarted; the second failure could result in a system restart; and the third failure could result in a system shutdown.
The method starts in step 200 when a program (application, system or otherwise) seeking service calls the routimes of the present method. In accordance with at least one embodiment, routimes embodying the present method are contained in a common dynamic linked library with offers three services: Register; Check-in; and Unregister. In step 202 a determination is made as to which of the three services is being requested.
When registration service is requested, the method proceeds to step 204 wherein the program requesting registration is identified. Subsequently, in step 206, a delta time is identified. The delta time is the maximum allowable time allowed for the program to check-in. Next in step 208, a remedial action list is identified and a pointer to the list, termed the error pointer EPn, is initialized to zero.
In at least the present embodiment, the remedial action list is an array containing identifiers of executables, one of which is to be executed each time the program fails to check-in within the delta time. Each time a remedial action is needed, the error pointer is increased by one and the next executable in the remedial action list is executed. It is envisioned that the executable will instruct the system to take increasingly sever measures. For example the first action could comprise shutting down and restarting the program. The second and third actions could simply be restarting the entire system, while the fourth action could be shutting the system down for maintenance.
In step 210, a registration entry is generated. In perhaps the preferred embodiment, the registration entry is a key in the XPE registry. The key could contain, for example, the program name, the delta time, the location of the remedial action list and a pointer into the list. In general, the information required to create the registration entry can be passed to the routime by the calling program (thereby constituting the steps of identifying). It may also be beneficial at this time to add a time stamp to the registration entry as a first check-in service. Alternatively, a request for check-in service can be issued immediately upon completion of the register service. The registration service thereafter ends in step 218 and the program initiating the registration service is now a watched program.
If, back in step 202, check-in service was requested by a watched program, the method proceeds to step 212 and the current system time is retrieved. Subsequently, in step 214, a time stamp is written to a location associated with the registration entry. Each time the check-in service is called, the time stamp is over written with the then current time. The watched program should be programmed to repeatedly call the check-in service within the delta time. The check-in service thereafter ends in step 218.
If, back in step 202, the remove registration service was requested, the method proceeds to step 216, and the registration entry is removed. In the present embodiment, this is accomplished by simply deleting the registry key in XPE. The remove registration service thereafter ends in step 218 and the program requesting the remove registration service is no longer a watched program.
The method starts in step 300 upon being invoked, preferably as part of a startup routime. In step 302, a list of programs currently registered, for example using the registration procedure described with respect to
In step 304, the time stamp (TS), delta, and error pointer (EPn) of next entry (the first entry if this is the first pass) is retrieved. Next, in step 306 the current time (CT) is retrieved. In step 308 the sum of the time stamp associated with the entry and the delta associated with the entry is compared to the current time. If the sum is greater than the current time, the program is deemed to be operating correctly and the method goes to step 310 to check if there are more watched programs to check. If unchecked watched programs remain, the method returns to step 304 and the time stamp (TS), delta, and error pointer (EPn) of next entry is retrieved.
If in step 308, the sum of the time stamp and the delta is less than the current time, the watched program has failed to timely request a check-in and the method proceeds to step 312. In step 312, the error pointer for the subject watched program (EPn) is incremented by 1. Next in step 314, the remedial action pointed to by the error pointer is undertaken. Optionally, a log of the failure and the remedial action can be made. Upon completion of the remedial action, the method goes to step 310 and to check if there are more watched programs to check. If any watched programs remain to be checked the method returns to step 304 and the time stamp (TS), delta, and error pointer (EPn) of next entry is retrieved.
Once all watched program have been checked, the method proceeds to step 316 where the methods waits for a prescribed interval prior to returning to step 302 to recheck the watched programs.
It will be appreciated by those skilled in the art that changes may be made to the above described embodiment without departing from the principles and spirit of the invention, the scope of which is defined in the claims and their equivalents. For example, the error pointer (EPn) can be reset after the expiration of a certain period of time, for example once a day. Alternatively, one for every X (e.g., 100) successful and uninterrupted check-ins the error pointer could be reduced. Optionally, any reduction or resetting of the counter can be logged.
By way of further example, Tables 1 through 4 present source code, compatible with XPE, implementing certain features of the present invention. In this implementation, the only remedial action is a system restart. Further, this implementation uses a LED indicator on the embedded system to communicate system status with an operator. One implementation of an LED indicator is discussed in co-pending U.S. patent application Ser. No. 10/726,769 incorporated herein by reference.
Although a couple embodiments of the present invention have been shown and described, it will be appreciated by those skilled in the art that changes may be made in these embodiments without departing from the principles and spirit of the invention, the scope of which is defined in the claims and their equivalents.