1. Field of the Invention
The present invention relates generally to an improved data processing system, and in particular to a computer implemented method, data processing system, and computer program product for identifying a subset of sensors to sample using timing-based criteria and using the identified time critical sensors to reduce the frequency of sensor access.
2. Description of the Related Art
As computing devices become more complex and run at greater speeds, sensors are increasingly being used to monitor the conditions of the computer components so that catastrophic failure conditions can be avoided. Modern processors contain sensors which provide information about local conditions on the processor or in the system. This local information may be used to monitor the processor or system, and possibly inform actions which will change the state of the processor or system. These local condition sensors may provide temperatures, path timings, or even voltages at particular locations in the processor and its associated circuitry. For example, temperature sensors may be used in association with processors and motherboards in order to determine when temperature levels of the processors exceed a threshold. Once the threshold is exceeded, various cooling devices or techniques may be employed to reduce the temperature to a safe level.
However, monitoring the sensors in the system at all times may be expensive in terms of processing time and may exceed system bandwidth. For instance, some sensors may provide information that is irrelevant at the current operating point of the processor, but are sampled regardless. Sampling of sensors which provide no useful information increases the amount of bandwidth, storage, and processing capability required to support sensor-guided decision making.
The illustrative embodiments provide a computer implemented method, data processing system, and computer program product for identifying a subset of sensors to sample using timing-based criteria and the identified time critical sensors to reduce the frequency of sensor access. The system determines rise times and records values for the sensors in the system. A time criticality of the sensors is determined based on the rise times. The system processes the sensors by first creating sensor subsets based on one or more constraints on the sensors. The system monitors the values of the sensors in a sensor subset and flags a sensor when it makes a determination that, prior to a next scheduled sampling of the sensor subset, the value of a sensor in the monitored sensor subset will exceed a threshold constraint. The system moves those flagged sensors to a second sensor subset which complies with the sensor's constraints.
The novel features believed characteristic of the invention are set forth in the appended claims. The invention itself, however, as well as a preferred mode of use, further objectives and advantages thereof, will best be understood by reference to the following detailed description of an illustrative embodiment when read in conjunction with the accompanying drawings, wherein:
With reference now to the figures and in particular with reference to
Computer 100 may be any suitable computer, such as an IBM® eServer™ computer or IntelliStation® computer, which are products of International Business Machines Corporation, located in Armonk, N.Y. Although the depicted representation shows a personal computer, other embodiments may be implemented in other types of data processing systems. For example, other embodiments may be implemented in a network computer. Computer 100 also preferably includes a graphical user interface (GUI) that may be implemented by means of systems software residing in computer readable media in operation within computer 100.
Next,
In the depicted example, data processing system 200 employs a hub architecture including a north bridge and memory controller hub (NB/MCH) 202 and a south bridge and input/output (I/O) controller hub (SB/ICH) 204. Processing unit 206, main memory 208, and graphics processor 210 are coupled to north bridge and memory controller hub 202. Processing unit 206 may contain one or more processors and even may be implemented using one or more heterogeneous processor systems. Graphics processor 210 may be coupled to the NB/MCH through an accelerated graphics port (AGP), for example.
In the depicted example, local area network (LAN) adapter 212 is coupled to south bridge and I/O controller hub 204, audio adapter 216, keyboard and mouse adapter 220, modem 222, read only memory (ROM) 224, universal serial bus (USB) and other ports 232. PCI/PCIe devices 234 are coupled to south bridge and I/O controller hub 204 through bus 238. Hard disk drive (HDD) 226 and CD-ROM 230 are coupled to south bridge and I/O controller hub 204 through bus 240.
PCI/PCIe devices may include, for example, Ethernet adapters, add-in cards, and PC cards for notebook computers. PCI uses a card bus controller, while PCIe does not. ROM 224 may be, for example, a flash binary input/output system (BIOS). Hard disk drive 226 and CD-ROM 230 may use, for example, an integrated drive electronics (IDE) or serial advanced technology attachment (SATA) interface. A super I/O (SIO) device 236 may be coupled to south bridge and I/O controller hub 204.
An operating system or hypervisor runs on processing unit 206. This operating system coordinates and controls various components within data processing system 200 in
Instructions for the operating system, the object-oriented programming system, and applications or programs are located on storage devices, such as hard disk drive 226. These instructions may be loaded into main memory 208 for execution by processing unit 206. The processes of the illustrative embodiments may be performed by processing unit 206 using computer implemented instructions, which may be located in a memory. An example of a memory is main memory 208, read only memory 224, or in one or more peripheral devices.
The hardware shown in
The systems and components shown in
Other components shown in
The depicted examples in
The illustrative embodiments provide a mechanism for identifying a subset of sensors to sample using timing-based criteria, and using the identified time critical sensors to reduce the frequency of sensor access. The illustrative embodiments provide an advantage over existing sensor sub-setting implementations, since existing sensor sub-setting implementations do not take into account the location and time criticality of any given sensor. Instead, existing sensor sub-setting implementations rely on random selection or signal-to-noise ratios to determine which sensors to sample.
With the illustrative embodiments, time-based criteria are used by a system monitor to identify which sensor subsets should be read and at what time frequency. While the illustrative embodiments allow for reading or sampling all sensors at each time interval, doing so uses extra bandwidth and processing resources which could be given to another source. Thus, the mechanism in the illustrative embodiments first segregates the sensors into subsets based on the time criticality criteria. Time criticality is a constraint on a sensor and reflects how frequently a sensor should be read to ensure safe operation of the system. The characterization of sensors into appropriate time-critical subsets allows a system monitor, such as in the form of a hypervisor, operating system, or out-of-band controller, to reduce the amount of sensor data that needs to be gathered and processed at any given time.
Once the system monitor determines the time criticality of each sensor, the system monitor may then use the subset information to determine the necessary read frequencies of the sensor subsets. The time critical sensor subsets may then be sampled less often according to the read frequencies to reduce the frequency of sensor access and processing. In addition to the time criticality constraint on the sensors, other constraints may also apply when determining the necessary read frequencies. The sensors may already be sub-setted in some manner, such as by the ability to read multiple sensors simultaneously, or by the time it takes to perform a read of a given sensor. For instance, these constraints may include hard-wired sensor groups which must be read simultaneously, the location of sensors on busses, bus speeds, or environmental conditions that lead to a change in time criticality. These constraints are taken into account when the system monitor defines the frequency of reads, since the constraints affect the time criticality of sensor reads. Only the impact of simultaneous reads is considered in the simple examples presented in
In one particular embodiment, a static, worst case scenario may be used to determine the time criticality of the sensors. For instance, in a worst case scenario, the ratios, and thus subsets, may be determined by the worst case possible rise time and the desired frequency of reading some known sensor. The rise time is the time it takes a sensor to go from one known temperature to another known temperature. For example, rise time may be measured from 10% above the starting temperature value to 90% of the ending temperature value, and typically captures the region of most rapid change in value. In addition, the initial procedure for identifying the time criticality of sensors may be performed during manufacturing-related tests, made part of the initial boot of a system, or performed at any time during operation by an agent such as a hypervisor or operating system. The initial identification relies on a set of test vectors which mimics the worst case workload or another workload of interest for the sensor type of interest which will be seen on the system. The timing information of the worst case workload or another workload of interest is then recorded.
For example, a system may run a set of vectors mimicking a worst case scientific workload. The temperatures, as reported by digital thermal sensors in the system, are recorded at a known frequency. In the temperature range of interest, the rise times for each digital thermal sensor are recorded. The digital thermal sensors may then be sorted into categories based on rise time. Rise times, taken into consideration when a sensor is already near the thermal limits of the system, give an indication of how often a specific digital thermal sensor must be read to ensure safe operation of the system. Further sub-setting may be performed by characterizing the fall times of each digital thermal sensor as well. The rise times of the various sensors under a worst case workload may be used to determine the ratio between reads of the various sensor subsets. If the sensors in the most critical subset need to be read every other cycle, then the remaining subsets may be read at some multiple of this base frequency.
In another example, the ratio between reads of the various sensors subsets may be determined in a reactive manner based on how hot each sensor subset is running. For instance, the digital thermal sensor subset that is running the hottest may be read at one frequency, while cooler subsets are read at a reduced frequency. The ratio may be determined by the delta between the coldest hot sensor and the hottest hot sensor, and how quickly a sensor can move from one sensor category (or subset) to another.
The membership of the subsets may change dynamically in response to changes in the values measured by the sensors. Similarly, the ratios between the subsets may change dynamically as the temperatures of its members and its actual member list change. These changes are reflective of changes in the time criticality of the sensors.
In another example, the ratio between reads of the various sensor subsets may be determined in a predictive manner. In this example, the system monitor may determine which sensors should be read by a prediction of the future dangers faced by the system. If a number of floating point intensive tasks are allocated, then a digital thermal sensor near the floating point unit may move from a less frequently accessed subset to a more frequently accessed subset. The ratios between subsets may be determined by taking into consideration both the current sensor readings and the predicted future need. The confidence of the prediction may reduce the ratios between the subsets to encapsulate a more conservative approach or to take into account the error in the prediction.
The particular sensors used to monitor the condition of the computer system in the illustrative examples are digital thermal sensors (DTS). However, it should be noted the mechanisms described herein may be applied to other sensor types, such as critical path monitors, out-of-band performance counters, voltage excursion monitors, and the like, as well as applied to sensors in other environments, such as various weather sensors.
Processor 302 may be any type of processor provided in a computing device. Processor 302 may be, for example, the central processing unit (CPU) of the computing device, a secondary processor, or a processor dedicated to controlling the functions of environmental control devices, such as processor fans and the like. In one particular embodiment, processor 302 may be a service processor (SP).
Processor 302 is coupled to environmental sensor controller 304 such that data may be transmitted between processor 302 and environmental sensor controller 304. Such coupling may include a system bus or the like.
Environmental sensor controller 304 may be any type of controller that sends and receives information to and from various sensors 306-310, processes the information, and provides processor 302 with reports of the conditions monitored by sensors 306-310. In one particular embodiment, environmental sensor controller 304 may be a System Power Control Network (SPCN) which may monitor thermal, RPM, and other sensors in the computing device.
Environmental sensor controller 304 is coupled to one or more sensors 306-310 via appropriate connections such that data may be transmitted to and from the sensors 306-310. Sensors 306-310 may be any type of sensor used to monitor environmental conditions of the computing device. For example, sensors 306-310 may include temperature sensors, RPM sensors, heat flow sensors, and the like.
At initialization, processor 302 obtains initial measurement values from each of sensors 306-310 via environmental sensor controller 304. Processor 302 then calculates a difference between the initial measurement values and a next warning/critical level. The warning/critical levels may be determined from information stored in a system memory, ROM, or the like (not shown).
Based on the difference between the initial measurement values and the next warning/critical level, processor 302 may send a message to environmental sensor controller 304 instructing environmental sensor controller 304 not to gather any reports until the predicted value of the sensor(s) meets or exceeds a reporting threshold calculated as a function of the difference between the initial measurement values and the next warning/critical level. This function may be any type of function suitable for the particular application in which the illustrative embodiments are implemented. These differences and/or report thresholds may be stored by environmental sensor controller 304 in a memory (not shown) in association with an identifier of sensors 306-310 to which the report threshold corresponds.
When a measured value from a sensor 306-310 is reported to environmental sensor controller 304 which exceeds the report threshold for the sensor 306-310, environmental sensor controller 304 sends a report to processor 302. Based on the reported sensor measurement value, processor 302 may send appropriate control signals to control systems to take corrective action to prevent a critical condition from occurring within the computing device. Thus, for example, once the temperature of a processor within the computing device is within a critical level, processor 302 may issue instructions to a system fan to start, and thereby attempt to reduce the temperature of the processor. Similarly, if the temperature were to keep increasing, processor 302 may take additional corrective action, and may even cause the processor to shut down rather than risk permanent damage to the processor.
When a sensor operation comprising fixed thresholds and variable frequency-based sub-setting is implemented, proper subsets may be derived using a single unique piece of information for each sensor, such as the rate of temperature increase in the region of interest. Graph 500 illustrates that two sensor limits, sensor limit 1510 and sensor limit 2512, are defined for the system. In each of the following examples, the system reaches an unsafe operating point when the temperature rises about sensor limit 2512. To guarantee continued safe operation of the system, as defined by no sensor exceeding sensor limit 2512, sensors S2502, S3504, S6506, and S7508 should be monitored. It should be noted that a different workload applied to the system may produce different temperature values for each sensor.
As shown in graph 500 by a fixed threshold (e.g., sensor limit 2512) and variable frequency-based sub-setting implementation, a sample is taken from all four sensors at time 47. The frequency of monitoring is variable. For example, sensor 3504 needs to be read again at time 48, which is the point at which the sensor temperature reaches sensor limit 2512. Similarly, sensor 7508 should be read again before 6 time intervals, sensor 2502 should be read again before 19 time intervals, and sensor 6506 never needs to be read because it cannot exceed sensor limit 2512. However, in practice, sensor 6506 would be read occasionally as an added safety measure to ensure that a sensor does not overheat under any workload or operating condition. The remaining four sensors S1422, S4428, S5430, and S8436 in
Based on the content of graph 500, the following reading ratios may be proposed for the sensors:
Thus, the variable frequency-based sub-setting implementation in the illustrative embodiments allow for improving upon the original bandwidth requirements with existing sensor operations which have a maximum value of 8× size of sensor read bytes per second rate and an average value of 8× size of sensor read bytes per second rate. The new requirements using the frequency-based sub-setting implementation in the illustrative embodiments allows for a maximum value of 4× size of sensor read bytes per second rate and an average of 2.15× size of sensor read bytes per second rate.
The temperature at a given time in
Based on sensor limit 2612, the system monitor determines that sensor 2602 should be read again in 19 intervals, sensor 3604 should be read again in 2 intervals, sensor 7608 should be read again in 5 intervals, and sensor 6606 never needs to be read. Thus, while the non-worst case workload in
Consequently, at interval 50, the subsets and reading requirements are:
The process begins with the system determining the read rise time functions for all sensors in the critical region (step 702). Rise time may be obtained using any suitable form of calibration which detects the time it takes a sensor to go from one value (e.g., temperature) to another value. The rise times provide an indication of how often a specific sensor must be read to ensure safe operation of the system when a sensor is already near an operating threshold. The system then records the values of the sensors in the critical region (step 704). The system then determines the initial time criticality and the other constraints of all sensors in the critical region (step 706). The time criticality of a sensor is a constraint which depends upon either the rise time function or the voltage margin function of the sensor, or a combination of both. The voltage margin function is a function which determines how close to the minimum safe voltage is the sensor, and allows for reading the sensor more frequently during certain events. The other constraints on the sensors may include, but are not limited to, hard-wired sensor groups which must be read simultaneously, the location of sensors on busses, bus speeds, or environmental conditions that lead to a change in time criticality. Based on the initial time criticality and other constraints of each sensor, the system constructs the initial sensor subsets including membership and frequency of the read requirements of the sensors (step 708).
The system then monitors all of the sensor subsets at the appropriate read times (step 710). For example, the system may read/process sensor values in subset A at every read time, and read/process values in subset B at every third read time, etc. When reading a current subset, a determination is made as to whether the value of a sensor in the current subset plus the sensor's rise time function may exceed a threshold of safe operation (i.e., thermal limit) prior to the next scheduled read of the sensor subset (step 712). If the value of the sensor in the currently read subset plus the sensor's rise time function may exceed the threshold before the next scheduled read of the sensor subset (‘yes’ output of step 712), the system flags the sensor (step 714). The system then determines if there are remaining sensors in the current subset to process (step 716).
Turning back to step 712, if the value of the sensor in the currently read subset plus the sensor's rise time function may not exceed the threshold before the next scheduled read of the sensor subset (‘no’ output of step 712), the process continues to process the next sensor in the subset if a next sensor is available (step 716).
If there are additional sensors in the subset (‘yes’ output of step 716), the process loops back to step 712 to process the next sensor in the subset. If there are no additional sensors in the subset (‘no’ output of step 716), the system then determines the time criticality of the flagged sensors, and moves the sensors to another sensor subset which complies with threshold constraints (step 718). The process then loops back to step 710 to process the next sensor subset.
The invention can take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment containing both hardware and software elements. In a preferred embodiment, the invention is implemented in software, which includes but is not limited to firmware, resident software, microcode, etc.
Furthermore, the invention can take the form of a computer program product accessible from a computer-usable or computer-readable medium providing program code for use by or in connection with a computer or any instruction execution system. For the purposes of this description, a computer-usable or computer readable medium can be any tangible apparatus that can contain, store, communicate, propagate, or transport the program for use by or in connection with the instruction execution system, apparatus, or device.
The medium can be an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system (or apparatus or device) or a propagation medium. Examples of a computer-readable medium include a semiconductor or solid state memory, magnetic tape, a removable computer diskette, a random access memory (RAM), a read-only memory (ROM), a rigid magnetic disk and an optical disk. Current examples of optical disks include compact disk-read only memory (CD-ROM), compact disk-read/write (CD-R/W) and DVD.
A data processing system suitable for storing and/or executing program code will include at least one processor coupled directly or indirectly to memory elements through a system bus. The memory elements can include local memory employed during actual execution of the program code, bulk storage, and cache memories which provide temporary storage of at least some program code in order to reduce the number of times code must be retrieved from bulk storage during execution.
Input/output or I/O devices (including but not limited to keyboards, displays, pointing devices, etc.) can be coupled to the system either directly or through intervening I/O controllers.
Network adapters may also be coupled to the system to enable the data processing system to become coupled to other data processing systems or remote printers or storage devices through intervening private or public networks. Modems, cable modem and Ethernet cards are just a few of the currently available types of network adapters.
The description of the present invention has been presented for purposes of illustration and description, and is not intended to be exhaustive or limited to the invention in the form disclosed. Many modifications and variations will be apparent to those of ordinary skill in the art. The embodiment was chosen and described in order to best explain the principles of the invention, the practical application, and to enable others of ordinary skill in the art to understand the invention for various embodiments with various modifications as are suited to the particular use contemplated.
Number | Date | Country | |
---|---|---|---|
Parent | 11754468 | May 2007 | US |
Child | 12342054 | US |