1. Technical Field
The present invention relates in general to the field of computers and similar technology systems, and in particular to mass storage devices associated with such technology systems. Still more particularly, the present invention relates to a method for controlling how often a Hard Disk Drive (HDD) is polled for Predictive Failure Analysis (PFA) alerts according to a temperature of a blade on which the HDD is mounted.
2. Description of the Related Art
Modern computers rely on a memory hierarchy for storing data and programs. The highest level (closest to execution units that are within a processor core used in the computer) of data is that found in registers, followed (in hierarchy level and decreasing retrieval speed) by cache memory, system memory, and mass storage memory. The most common type of mass storage memory is a Hard Disk Drive (HDD), which is made up of one or more platters that store data. Such data are stored by read/write heads that magnetize tiny areas on spinning (rotating) platters. These tiny areas represent ones and zeros that make up binary bits of data that are used by the computer. The read/write heads float above the platters on a thin cushion of air that is generated by the rotation of the platters. If the platters should stop spinning while the head is over a sensitive area (i.e., a writable area that is not designated as a head parking area on the platter), the head will “crash” into the platter, causing permanent mechanical damage to the head and possibly the platter.
To predict if and when a HDD will fail (due to a head crash, power failure, software failure, etc.), many HDD systems utilize Predictive Failure Analysis (PFA). PFA predicts failures to HDD systems either by the use of error logs (symptom driven) or Generalized Error Measurements (GEM) (measurement driven). Usage of error logs and GEM are similar in nature, but differ somewhat in application. Error logs are the output of data, non-data and motor start error recovery logs. An analysis of these logs by PFA is performed periodically during idle periods in which the HDD is not actively being written to or read from. These logs provide a history that, when compared to other histories that preceded an HDD failure, can be used to predict a failure of the current HDD operation. Similarly, GEM automatically performs a suite of self-diagnostic tests that measure any changes to the operation of the HDD. These tests use real-time measurements of head flying height (distance between the read/write had and the platter), signal coherence, channel noise, etc. These measurements are then compiled to predict a failure of the HDD operation. This prediction is not history-based, but rather is performance based.
Typically, the execution of PFA routines (including polling of logs, sensors, etc.) is on a static timetable. That is, PFA routines are executed at pre-determined time intervals (in the case of GEM), or only during idle periods (in the case of error log analysis). When setting the pre-determined time intervals (or deciding whether to wait for a next idle period or to poll error logs sooner), a balance must be struck between safety and performance. Thus, if PFA routines are run too often, then HDD performance is degraded, since the HDD is busy running PFA routines rather than reading/writing normal program data. Conversely, if the PFA routines are run too infrequently, then a fatal condition may occur before that fatal condition can be predicted and avoided by the PFA.
To address the problem described above, an improved method, apparatus and computer-readable medium for controlling a Predictive Failure Analysis (PFA) is presented.
For various reasons, the operation of a Hard Disk Drive (HDD) is very sensitive to high temperatures. High temperatures can cause, among other problems, head flying height to change to the point that either a read/write head is too far away from the platter to accurately record bits of data, or else the read/write head may be too close to the platter, resulting in a crash of the read/write head against the spinning platter. Likewise, high temperatures can cause erratic electronic signal production, both to control signals as well as data signals.
Thus, the present invention, by recognizing the danger that high temperatures pose to the HDD, adjusts PFA polling rates according to how hot an HDD or an associated blade environment is. In one embodiment, the invention includes the steps of setting a maximum temperature at which an electronic device (for example, a hard disk drive) can operate without damage; measuring a temperature of the electronic device; and increasing a frequency of Predictive Failure Analysis (PFA) operations as the temperature of the electronic device approaches the maximum temperature. The PFA operations include, but are not limited to, polling performance data that is being collected in real-time for the electronic device.
The above, as well as additional purposes, features, and advantages of the present invention will become apparent in the following detailed written description.
The novel features believed characteristic of the invention are set forth in the appended claims. The invention itself, however, as well as a preferred mode of use, further purposes and advantages thereof, will best be understood by reference to the following detailed description of an illustrative embodiment when read in conjunction with the accompanying drawings, where:
a illustrates an exemplary computer system in which the present invention may be implemented;
b-c depict additional detail of a Hard Disk Drive (HDD) used by the computer system shown in
d provides additional detail of components of a Thermal Based Predictive Failure Analysis Program (TBPFAP) that adjusts Predictive Failure Analysis (PFA) polling rates according to blade temperature; and
With reference now to the figures, and in particular to
Host system 102 is able to communicate with a server 150 via a network 128 using a network interface 130, which is coupled to system bus 106. Network 128 may be an external network such as the Internet, or an internal network such as an Ethernet or a Virtual Private Network (VPN). Server 150 may have a similar architecture as described for host system 102.
A hard drive interface 132 is also coupled to system bus 106. Hard drive interface 132 interfaces with a Hard Disk Drive (HDD) 134. In a preferred embodiment, HDD 134 populates a system memory 136, which is also coupled to system bus 106. Data that populates system memory 136 includes host system 102's operating system (OS) 138 and application programs 144.
OS 138 includes a shell 140, for providing transparent user access to resources such as application programs 144. Generally, shell 140 is a program that provides an interpreter and an interface between the user and the operating system. More specifically, shell 140 executes commands that are entered into a command line user interface or from a file. Thus, shell 140 (as it is called in UNIX®), also called a command processor in Windows®, is generally the highest level of the operating system software hierarchy and serves as a command interpreter. The shell provides a system prompt, interprets commands entered by keyboard, mouse, or other user input media, and sends the interpreted command(s) to the appropriate lower levels of the operating system (e.g., a kernel 142) for processing. Note that while shell 140 is a text-based, line-oriented user interface, the present invention will equally well support other user interface modes, such as graphical, voice, gestural, etc.
As depicted, OS 138 also includes kernel 142, which includes lower levels of functionality for OS 138, including providing essential services required by other parts of OS 138 and application programs 144, including memory management, process and task management, disk management, and mouse and keyboard management.
Application programs 144 include a browser 146. Browser 146 includes program modules and instructions enabling a World Wide Web (WWW) client (i.e., host system 102) to send and receive network messages to the Internet using HyperText Transfer Protocol (HTTP) messaging, thus enabling communication with server 150.
Application programs 144 in host system 102's system memory also include a Thermal Based Predictive Failure Analysis Controller (TBPFAC) 148. TBPFAC 148 includes code for implementing the processes described below in
The hardware elements depicted in host system 102 are not intended to be exhaustive, but rather are representative to highlight essential components required by the present invention. For instance, host system 102 may include alternate memory storage devices such as magnetic cassettes, Digital Versatile Disks (DVDs), Bernoulli cartridges, and the like. These and other variations are intended to be within the spirit and scope of the present invention.
Referring now to
With reference now to
Referring again to
As depicted in
Referring now to
Referring now to
Note that by increasing the frequency of PFA operations as HDD temperature nears a dangerous level, there is a greater chance that fatal errors can be detected and corrected. Note also that the temperature that is deemed as “approaching” or “nearing” the maximum safe temperature may be set as either at single fixed point (e.g., 90% the “fatal” temperature), or “nearing” may be broken out into multiple breakpoints. Thus, if the temperature reaches a first breakpoint temperature (e.g., 80% of the “fatal” temperature), then a first frequency of PFA operations may be implemented. When the temperature of the electronic device (e.g., HDD) reaches a second breakpoint temperature (e.g., 90% of the “fatal” temperature), then a higher second frequency of PFA operations may be implemented. In one embodiment, the number of PFA operations increases in a non-linear manner as the temperature of the electronic device approaches the “fatal” (maximum temperature at which the electronic device can operate without damage) temperature.
It should be understood that at least some aspects of the present invention may alternatively be implemented in a computer-useable medium that contains a program product. Programs defining functions on the present invention can be delivered to a data storage system or a computer system via a variety of signal-bearing media, which include, without limitation, non-writable storage media (e.g., CD-ROM), writable storage media (e.g., hard disk drive, read/write CD ROM, optical media), and communication media, such as computer and telephone networks including Ethernet, the Internet, wireless networks, and like network systems. It should be understood, therefore, that such signal-bearing media when carrying or encoding computer readable instructions that direct method functions in the present invention, represent alternative embodiments of the present invention. Further, it is understood that the present invention may be implemented by a system having means in the form of hardware, software, or a combination of software and hardware as described herein or their equivalent.
While the present invention has been particularly shown and described with reference to a preferred embodiment, it will be understood by those skilled in the art that various changes in form and detail may be made therein without departing from the spirit and scope of the invention. Furthermore, as used in the specification and the appended claims, the term “computer” or “system” or “computer system” or “computing device” includes any data processing system including, but not limited to, personal computers, servers, workstations, network computers, main frame computers, routers, switches, Personal Digital Assistants (PDA's), telephones, and any other system capable of processing, transmitting, receiving, capturing and/or storing data.