1. Field of the Invention
This invention is related in general to the field of data storage systems. In particular, the invention consists of a system for dynamically scaling error thresholds in a data communication fabric.
2. Description of the Prior Art
In
The communication system 18 may be a communication bus, a point-to-point network, or other communication scheme.
Some errors result from faulty cables, power transients, or defective components. Some of these types of errors can be tolerated and accommodated by the communication fabric 20 as spurious events. However, a large number of non-critical errors may indicate impending component failure or that a component is in an unstable state requiring re-initialization. Counters may be used to track these non-critical errors. When a counter exceeds a pre-determined threshold, corrective action may be taken by resetting a device, quiescing a device so that it may be repaired, or fencing a device so that it may be taken offline for replacement.
Typically, a system is configured with a default set of thresholds for error recovery, regardless of the number of each type of system resource. However, a one-size-fits-all approach often leads to inefficient use of system resources as use of system resources for error recovery may occur too early or too late.
In U.S. Pat. No. 5,331,476, Fry et al. disclose a data storage apparatus incorporating an error recovery system that is dynamically controlled to perform knowledge-based error recovery. However, the Fry invention does not take into account the number of available resources when dynamically performing error recovery. This may result in all resources engaging in error recovery while leaving no resources available for the performance of data transfer. Accordingly, it is desirable to have a system for scaling error thresholds in relation to the number of corresponding system resources.
The invention disclosed herein utilizes a system of increasing or decreasing the error threshold values of all like system resource devices based on the total number of these devices. When a few devices are available, taking even a single device off-line can severely limit the bandwidth of the communication system. As such, a device should only be taken off-line when the error condition is serious or occurs with a high degree of frequency. Conversely, when a large number of devices are available, taking one or more devices off-line may have a negligible impact on system throughput. Accordingly, threshold values are set inversely proportional to the number of available devices. When the number of devices is relatively large, the error threshold values are set low and when the number of devices is relatively low, the error threshold values are set high.
Various other purposes and advantages of the invention will become clear from its description in the specification that follows and from the novel features particularly pointed out in the appended claims. Therefore, to the accomplishment of the objectives described above, this invention comprises the features hereinafter illustrated in the drawings, fully described in the detailed description of the preferred embodiments and particularly pointed out in the claims. However, such drawings and description disclose just a few of the various ways in which the invention may be practiced.
This invention is based on the idea of using a dynamically scaled error threshold to regulate error recovery actions within a communication fabric of a computer storage system. The invention disclosed herein may be implemented as a method, apparatus or article of manufacture using standard programming or engineering techniques to produce software, firmware, hardware, or any combination thereof. The term “article of manufacture” as used herein refers to code or logic implemented in hardware or computer readable media such as optical storage devices, and volatile or non-volatile memory devices. Such hardware may include, but is not limited to, field programmable gate arrays (“FPGAs”), application-specific integrated circuits (“ASICs”), complex programmable logic devices (“CPLDs”), programmable logic arrays (“PLAs”), microprocessors, or other similar processing devices.
Referring to figures, wherein like parts are designated with the same reference numerals and symbols,
Error thresholds 127 are written by the software subcomponent 122a to each of the memory locations 125. The fabric controller 124 connects the processing device 122 to the host adapter 126 and the host adapter connects the communication fabric 120 to a host server (“host”). The processing device 122 may be a data processing server or a symmetric multi-processor (“SMP”) complex. The invention regulates error recovery actions to remedy these error conditions based on dynamically scaled error thresholds.
In this embodiment of the invention, five disparate error conditions may exist: (1) component timeout, (2) adapter warmstart timeout, (3) fabric interrupt, (4) adapter failure, and (5) adapter interrupt. A component timeout indicates that a fabric component has failed to provide an acknowledgement. An adapter interrupt indicates that the adapter has detected a failure but has not failed internally. A fabric interrupt indicates that a bus protocol violation has occurred.
A dynamic threshold scaling algorithm 200 is illustrated by the flow chart of
In step 206, the error threshold is dynamically adjusted in inverse proportion to the number of available resources. If the number of resources increased due to the activation of a host adapter 126, the error threshold is reduced. If the number of resources decreased due to the deactivation of a host adapter 126, the error threshold is increased.
Those skilled in the art of making error recovery systems may develop other embodiments of the present invention. However, the terms and expressions which have been employed in the foregoing specification are used therein as terms of description and not of limitation, and there is no intention in the use of such terms and expressions of excluding equivalents of the features shown and described or portions thereof, it being recognized that the scope of the invention is defined and limited only by the claims which follow.