This application is based upon and claims the benefit of priority from the prior Japanese Patent Application No. 2014-243548 filed on Dec. 1, 2014, the entire contents of which are incorporated herein by reference.
The embodiments discussed herein are related to a system monitoring technology.
A service processor is installed in a large scale server in order to monitor and control components provided in the large scale server.
Related technologies are disclosed in, for example, Japanese Laid-Open Patent Publication No. S60-074100 (Japanese Examined Patent Application Publication No. H3-30915), Japanese Laid-Open Patent Publication No. H08-125622, Japanese Laid-Open Patent Publication No. 2012-230597, and Japanese Laid-Open Patent Publication No. 2014-016671.
According to one aspect of the embodiments, an information processing apparatus includes: a processor; a module; and a controller, wherein the processor is configured to transmit a first condition for detecting an abnormality of the module to the controller, and the controller is configured to: acquire a first information from the module; determine whether the first information satisfies the first condition; and transmit a second information indicating that the abnormality of the module is detected to the processor when the first information satisfies the first condition.
The object and advantages of the invention will be realized and attained by means of the elements and combinations particularly pointed out in the claims. It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory and are not restrictive of the invention, as claimed.
A service processor is an independent processing unit which includes, for example, a central processing unit (CPU), a memory and the like. A target component to be monitored and controlled may include, for example, a CPU, a memory, an HDD (Hard Disk Drive) or an SSD (Solid State Drive), a cooling fan, and a temperature sensor. The service processor is installed such that an abnormality occurring in the component within the server is detected and notified to a server manager.
The processing load of the CPU of the service processor increases as the number of components within the server is increased. When the processing load of the CPU in the service processor increases, a processing delay occurs and a countermeasure for coping with the abnormality occurring in the component within the server may be delayed. In a technology for monitoring an apparatus, the processing load of the CPU of the service processor may not be reduced.
The service processor 1000 includes a CPU 1001, a Read Only Memory (ROM) 1002, a Random Access Memory (RAM) 1003, and a Flash Memory (FMEM) 1004.
The CPU 1001 may load firmware stored in the ROM 1002 onto the RAM 1003 to execute the firmware so as to execute the function as illustrated in
The system board 100 as illustrated in
The MBC 110 includes an execution control unit 111, a buffer management unit 112, a Joint Test Action Group (JTAG) control circuit 113, and an Inter-Integrated Circuit (I2C) control circuit 114. The JTAG and I2C may be used as a protocol, and other protocols may be used as well.
The execution control unit 111 executes a command set stored in a command I/F (Interface) area 121 of the buffer 120 to control the JTAG control circuit 113 and the I2C control circuit 114. The JTAG control circuit 113 acquires data from the components 101 and 102 to output the data to the execution control unit 111. The I2C control circuit 114 acquires data from the components 103 to 105 to output the data to the execution control unit 111. The buffer management unit 112 manages the buffer 120.
The buffer 120 includes the command I/F area 121 and a result I/F area 122.
The command portion may include the designations of target components from which data are to be acquired.
The register 130 illustrated in
The processing unit 1011 of the service processor 1000 reads a value to be set to the interval register 132 from the setting data storing unit 1010. The processing unit 1011 notifies the MBC 110 of the system board 100 of the read value of the interval register 132 (Operation S1 of
The processing unit 1011 reads a command set, a threshold value, information indicating a comparison type, and a value of the VALID flag, for example, “ON,” that are relevant for each component from the setting data storing unit 1010. The processing unit 1011 notifies the MBC 110 of the system board 100 of the read command set, threshold value, information indicating the comparison type, and the value of the VALID flag (Operation S5). Accordingly, the buffer management unit 112 of the MBC 110 receives the command set, threshold value, information indicating the comparison type, and the value of VALID flag relevant for each component and stores the received ones in the command I/F area 121 (Operation S7).
The processing unit 1011 reads the value, for example, “ON” to be set to the execution register 133 from the setting data storing unit 1010. The processing unit 1011 notifies the MBC 110 of the system board 100 of the read value of the execution register 133 (Operation S9). Accordingly, the execution control unit 111 of the MBC 110 receives the value of the execution register 133 from the processing unit 1011 and stores the received value in the execution register 133 (Operation S11).
The execution control unit 111 of the MBC 110 executes a monitoring process (Operation S13).
The execution control unit 111 instructs the buffer management unit 112 to read the command list relevant for the components 101 to 105. The buffer management unit 112 reads the command list relevant for the components 101 to 105 from the buffer 120 to output the command list to the execution control unit 111. The execution control unit 111 sequentially executes the command set, for example, a single command or a plurality of the commands, of each component so as to control the JTAG control circuit 113 and the I2C control circuit 114, and acquire data from each component (Operation S21 of
The execution control unit 111 outputs the data acquired from the components 101 to 105 to the buffer management unit 112. The buffer management unit 112 stores the data acquired from the components 101 to 105 in the result I/F area 122 (Operation S23).
The buffer management unit 112 specifies a single unprocessed command list from the command I/F area 121 (Operation S25).
The buffer management unit 112 determines whether the value of the VALID flag included in the command list specified at Operation S25 is “ON” (Operation S27).
When it is determined that the value of the VALID flag included in the command list specified at Operation S25 is not “ON” (“NO” route at Operation S27), the value of the VALID flag is “OFF.” The monitoring process proceeds to Operation S45. When it is determined that the value of the VALID flag included in the command list specified at Operation S25 is “ON” (“YES” route at Operation S27), the buffer management unit 112 determines whether the information indicating the comparison type included in the command list specified at Operation S25 indicates a “coincidence” (Operation S31).
When it is determined that the information indicating the comparison type indicates the “coincidence” (“YES” route at Operation S31), the buffer management unit 112 determines whether the threshold value included in the command list specified at Operation S25 is coincident with the data acquired from the component associated with the command list specified at Operation S25 (Operation S33).
When it is determined that the threshold value is coincident with the data acquired from the component (“YES” route at Operation S33), the buffer management unit 112 stores the determination result indicating that the abnormality is not present in the component, for example, indicating that the component is normal, in the determination result storing area of the result I/F area 122 (Operation S35). The buffer management unit 112 increments a generation for the previously stored determination result by 1 (one), deletes the data relevant for the generation n+1, and stores the determination result in the determination result storing area as the data relevant for the generation 1. The monitoring process proceeds to Operation S45.
When it is determined that the information indicating the comparison type does not indicate “coincidence” (“NO” route at Operation S31), the comparison type is a “range.” Accordingly, the buffer management unit 112 determines whether the data acquired from the component associated with the command list specified at Operation S25 is included in a range determined by the upper limit threshold value and the lower limit threshold value included in the command list specified at Operation S25 (Operation S37).
When it is determined that the data acquired from the component is included in the range determined by the upper limit threshold value and the lower limit threshold value (“YES” route at Operation S37), the buffer management unit 112 stores the determination result indicating that the abnormality is not present in the component, for example, indicating that the component is normal, in the determination result storing area of the result I/F area 122 (Operation S39). The buffer management unit 112 increments the generation of the previously stored determination result by 1 (one), deletes the data relevant for the generation n+1, and stores the determination result in the determination result storing area as the data relevant for the generation 1. The monitoring process proceeds to Operation S45.
When it is determined that the data acquired from the component is not included in the range determined by the upper limit threshold value and the lower limit threshold value (“NO” route at Operation S37) and when it is determined that the threshold value is not coincident with the data acquired from the component (“NO” route at Operation S33), the buffer management unit 112 stores the determination result indicating that the abnormality of the component is detected in the determination result storing area of the result I/F area 122 (Operation S41).
The buffer management unit 112 notifies the execution control unit 111 of the fact that the abnormality of the component is detected. Accordingly, the execution control unit 111 sets the value of the interrupt register 131 to “ON” and transmits an interrupt signal to the service processor 1000 (Operation S43).
The buffer management unit 112 determines whether an unprocessed command list exists (Operation S45). When it is determined that the unprocessed command list exists (“YES” route at Operation S45), the buffer management unit 112 specifies one of the unprocessed command lists (Operation S29) and the monitoring process goes back to the processing performed at Operation S27. When it is determined that the unprocessed command list does not exist (“NO” route at Operation S45), the buffer management unit 112 sets the current time as the time at which the previous monitoring was executed, and stores the set time in the RAM 107. The monitoring process proceeds to Operation S47 of
As illustrated in
When it is determined that the current time is not the execution timing (“NO” route at Operation S49), the execution control unit 111 stops a processing for a certain period of time, and the monitoring process goes back to Operation S49. When it is determined that the current time is the execution timing (“YES” route at Operation S49), the execution control unit 111 determines whether the value of the execution register 133 is “ON” (Operation S51).
When it is determined that the value of the execution register 133 is “ON” (“YES” route at Operation S51), the monitoring process goes back to Operation S21 of
The service processor 1000 collectively transmits the command lists relevant for a plurality of components to the MBC 110, and the service processor 1000 is notified of the detection of the abnormality only when the abnormality is detected by the MBC 110. Therefore, the processing load of the CPU 1001 is reduced and the occurrence of the processing delay may be decreased. Even though the number of components is increased, an increase of the processing load of the CPU 1001 may be reduced.
The MBC 110 which is hardware is suitable for a simple repetitive processing or a batch processing, but not suitable for a processing including a complex branching. Accordingly, a processing suitable for the MBC 110 is executed by the MBC 110 rather than the service processor 1000. The processing may be efficiently executed and a high-speed processing may be achieved in the entire information processing apparatus 1.
The processing unit 1011 of the service processor 1000 which has received the interrupt signal specifies the component, for which the abnormality is detected, from the determination result storing area (Operation S61 of
The processing unit 1011 compares the data stored in the determination result storing area with a threshold value (Operation S63), and determines whether the determination made by the MBC 110 is correct (Operation S65). When it is determined that the determination made by the MBC 110 is not correct (“NO” route at Operation S65), the processing unit 1011 stores an error log in the FMEM 1004 (Operation S67). The error log may include, for example, information indicating that the determination made by the MBC 110 is not correct. The service processor 1000 may output the error log to, for example, a display device.
The processing unit 1011 executes a restart of the MBC 110 (Operation S69). The process performed by the service processor is ended.
When it is determined that the determination made by the MBC 110 is correct (“YES” route at Operation S65), the processing unit 1011 determines whether the detection of the abnormality is continued for a certain number of times (Operation S71). When the certain number of times is, for example, 3 (three), it is determined whether each of the determination result of the generation 1, the determination result of the generation 2, and the determination result of the generation 3 indicates that the abnormality is detected.
When it is determined that the detection of the abnormality is not continued for the certain number of times (“NO” route at Operation S71), it is estimated that the abnormality does not occur and thus, the process is ended. When it is determined that the detection of the abnormality is continued for the certain number of times (“YES” route at Operation S71), the processing unit 1011 stores the error log in the FMEM 1004 (Operation S73). The error log may include, for example, identification information of the component specified at Operation S61.
The processing unit 1011 notifies the MBC 110 of the system board 100 of the value of the execution register 133, for example, “OFF” (Operation S75). Accordingly, the buffer management unit 112 of the MBC 110 receives the value of the execution register 133 from the processing unit 1011 and stores the value in the execution register 133.
The processing unit 1011 notifies the MBC 110 of the system board 100 of the value of the VALID flag, for example, “OFF” and the identification information of the specified component (Operation S77). Accordingly, the buffer management unit 112 of the MBC 110 receives the value of the VALID flag and the identification information of the specified component from the processing unit 1011 and stores the value of the VALID flag in an area of the command I/F area 121 relevant for the specified component. The process is ended. It may be possible to reduce the retransmission of an interrupt signal for the specified component.
By the process as described above, the service processor 1000 which has received an interrupt signal may rapidly perform the countermeasure against the abnormality. Since it is confirmed whether an error exists in the determination made by the MBC 110, the performing of the countermeasure against the abnormality may be reduced even though the abnormality originally has not occurred. The data acquisition is stopped for all the components while coping with the abnormality, for example, during the maintenance of a certain component. Therefore, the acquisition of wrong data due to the performing of a countermeasure against the abnormality may be reduced.
The processing unit 1011 detects that a certain event has occurred (Operation S81 of
The processing unit 1011 notifies the MBC 110 of the system board 100 of the value of the execution register 133, for example, “OFF” (Operation S83). Accordingly, the buffer management unit 112 of the MBC 110 receives the value of the execution register 133 from the processing unit 1011 and stores the value in the execution register 133.
The processing unit 1011 notifies the MBC 110 of the system board 100 of the value of the VALID flag, for example, “OFF” and the identification information of the component related to the event (Operation S85). Accordingly, the buffer management unit 112 of the MBC 110 receives the value of the VALID flag and the identification information of the component related to the event from the processing unit 1011 and stores the value of the VALID flag in an area of the command I/F area 121 relevant for the component related to the event. The process is ended. It may be possible to reduce the retransmission of an interrupt signal for the component related to the event.
By the process as described above, monitoring may be stopped appropriately in accordance with the occurrence of the event.
The manager of the information processing apparatus 1 may perform a setting of increasing the number of revolutions of the cooling fan in accordance with, for example, an increase of an outside air temperature.
Accordingly, the processing unit 1011 of the service processor 1000 notifies the MBC 110 of the system board 100 of the value of the VALID flag, for example, “OFF” and the identification information of the component, for example, the cooling fan (Operation S91 of
The processing unit 1011 generates a new threshold value according to the setting after being changed. When the number of revolutions of, for example, the cooling fan is changed from 1000 rpm (revolution per minute) to 1500 rpm, the upper limit threshold value is changed from 1100 rpm to 1600 rpm and the lower limit threshold value is changed from 900 rpm to 1400 rpm. The processing unit 1011 notifies the MBC 110 of the system board 100 of the new threshold value (Operation S95). Accordingly, the buffer management unit 112 of the MBC 110 receives the threshold value and stores the threshold value in an area of the command I/F area 121 relevant for a target component, for example, a cooling fan (Operation S97).
After a certain time elapses, the processing unit 1011 notifies the MBC 110 of the system board 100 of the value of the VALID flag, for example, “ON” and the identification information of the component, for example, the cooling fan (Operation S99). Accordingly, the buffer management unit 112 of the MBC 110 receives the value of the VALID flag and the identification information of the component and stores the value of the VALID flag in an area of the command I/F area 121 relevant for a target component, for example, a cooling fan (Operation S101).
The execution control unit 111 of the MBC 110 executes a monitoring process (Operation S103). The monitoring process may be the monitoring process illustrated in
When settings of, for example, hardware are changed by the process as described above, the threshold value for an abnormality detection may be dynamically changed and thus, the monitoring may be continued appropriately.
The configuration of the functional block of, for example, the service processor 1000 may not be coincident with the configuration of a program module.
Also, in a processing flow, a processing sequence may be changed and a parallel execution may be performed as long as the processing result is not changed.
When a secondary failure occurs, the process described above may be executed after the component which results in a failure is specified by employing, for example, a well-known art. The replacement of a component which is originally not in a failure state may be reduced.
The information processing apparatus includes a processor, a module, and a controller. The processor transmits a condition for detecting the abnormality of the module to the controller. The controller acquires information from the module and determines whether the information acquired from the module satisfies the condition. When the information acquired from the module satisfies the condition, the controller transmits the information indicating that the abnormality of the module is detected to the processor.
A notifying to the processor is performed only when the abnormality is detected. Further, the controller executes a simple processing suitable for the controller. The processing load of the processor is reduced and thus, a high speed processing may be achieved in the entire processing.
The information processing apparatus may also include a storage device. The controller stores the information acquired from the module in the storage device. When the information indicating that the abnormality of the module is detected is received from the controller, the processor reads the information, which is acquired from the module, from the storage device and determines whether the information acquired from the module satisfies the condition. When the information acquired from the module satisfies the condition, a processing to cope with the abnormality of the module may be executed. It may be confirmed whether there is an error in the abnormality detected by the controller. Since the processor confirms only the abnormality detected by the controller, an increase in the processing load of the processor may be reduced.
When the information acquired from the module satisfies the condition, the processor transmits a first request requesting to stop monitoring of the module to the controller. When the first request is received from the processor, the controller may stop the monitoring of the module. Notifying of the detection of the abnormality of the module to the processor several times may be reduced.
The processor transmits the first request requesting to stop monitoring of the module and a second request requesting to change the condition to a second condition for detecting the abnormality of the module to the controller. When the first request and second request are received from the processor, the controller may stop monitoring of the module and change the condition to the second condition. Detecting the abnormality which does not need to be detected due to a condition change may be reduced.
The controller may transmit information indicating that the abnormality of the module is detected to the processor by an interrupt. The processor may rapidly start the process.
The processor transmits a condition for detecting the abnormality of the module to controller which monitors the abnormality of the module. The controller acquires information from the module and determines whether the information acquired from the module satisfies the condition. When the information acquired from the module satisfies the condition, the controller transmits, to the processor, information indicating that the abnormality of the module is detected.
A program for causing the processor to perform the process described above may be created. The program may be stored in a computer-readable storage medium, such as for example, a flexible disk, a CD-ROM, a magneto-optical disk, a semiconductor memory, and a hard disk, or a storage device. An intermediate processing result may be temporarily stored in a storage device, for example, a main memory.
All examples and conditional language recited herein are intended for pedagogical purposes to aid the reader in understanding the invention and the concepts contributed by the inventor to furthering the art, and are to be construed as being without limitation to such specifically recited examples and conditions, nor does the organization of such examples in the specification relate to an illustrating of the superiority and inferiority of the invention. Although the embodiments of the present invention have been described in detail, it should be understood that the various changes, substitutions, and alterations could be made hereto without departing from the spirit and scope of the invention.
Number | Date | Country | Kind |
---|---|---|---|
2014-243548 | Dec 2014 | JP | national |