This application is based upon and claims the benefit of the prior Japanese Patent Application No. 2018-215918, filed on Nov. 16, 2018, the entire contents of which are incorporated herein by reference.
The embodiments discussed herein are related to an information processing apparatus.
Prior to the operation of an information processing apparatus (computer), a POST (Power-On Self-Test) is typically performed by a BIOS (Basic Input/Output System). The POST is performed by executing a POST program, which is a test program, when the BIOS is booted, and includes a process of detecting and initializing each component in the information processing apparatus.
There is known a restart control system that automatically restarts an information processing apparatus when a failure occurs in the information processing apparatus (see, e.g., Japanese Laid-open Patent Publication No. 07-168729). There is also known a dynamic single clock trace method in a logic device operating in synchronization with a clock (see, e.g., Japanese Laid-open Patent Publication No. 01-131934).
Related techniques are disclosed in, for example, Japanese Laid-open Patent Publication Nos. 07-168729 and 01-131934.
According to an aspect of the embodiments, an information processing apparatus includes a memory in which a monitor program is stored, and a processor coupled to the memory and configured to execute the monitor program with a first amount of log information to be output during an execution of the monitor program, detect an occurrence of a failure while the monitor program is being executed with the first amount, change an amount of the log information from the first amount to a second amount larger than the first amount when the occurrence of the failure is detected while the monitor program is being executed with the first amount, execute the monitor program with the second amount, change the amount of the log information from the second amount to a third amount smaller than the second amount when the occurrence of the failure is not detected while the monitor program is being executed with the second amount, execute the monitor program with the third amount, and analyze the log information when the occurrence of the failure is detected while the monitor program is being executed with the second amount or executed with the third amount.
The object and advantages of the invention will be realized and attained by means of the elements and combinations particularly pointed out in the claims.
It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory and are not restrictive of the invention, as claimed.
When the POST program hangs up at the time of booting the BIOS, a suspicious location of failure occurrence is identified by analyzing the BIOS log output during the execution of the POST program. However, when the BIOS log is insufficient, the identification accuracy of the suspicious location is lowered. Without being limited to a case when analyzing the BIOS log output during the execution of the POST program, even when analyzing a log output during the execution of other programs, the identification accuracy of the suspicious location is lowered when the log is insufficient.
Hereinafter, an embodiment of a technique of improving the identification accuracy of a suspicious location when a failure occurs during execution of a program in an information processing apparatus will be described in detail with reference to the drawings.
The CPU 101 operates as a log notification unit 111 and a POST code transmission unit 112 by executing a BIOS program when the information processing apparatus is powered on. At booting of the BIOS, the CPU 101 performs a POST by executing a POST program 113 including modules 114-1 to 114-N(N is an integer of 2 or more). As the modules 114-1 to 114-N, for example, the following ones are used.
(a) Memory initialization/test module
(b) CPU initialization/test module
(c) Chipset initialization/test module
(d) Legacy device initialization/test module
(e) Other device initialization/test module
(f) Data construction module
(g) RAS (Reliability Availability Serviceability) function initialization module
The memory initialization/test module is a module that initializes and tests a memory, and the CPU initialization/test module is a module that initializes and tests the CPU 101. The chipset initialization/test module is a module that initializes and tests a chipset. The legacy device initialization/test module is a module that initializes and tests a legacy device, and the other device initialization/test module is a module that initializes and tests other devices.
The data construction module is a module that constructs data such as an ACPI (Advanced Configuration and Power Interface) and an SMBIOS (System Management BIOS) which are used by an OS (Operating System). The RAS function initialization module is a module that initializes the RAS function.
The BMC 102 includes a BIOS log storage area 121, an event log storage area 122, a hang-up detection unit 123, and a POST code storage area 124, manages hardware included in the information processing apparatus, and monitors the operation of the information processing apparatus.
The log notification unit 111 transfers a BIOS log output during the execution of the POST program 113 to the BMC 102 via a serial port, and the BMC 102 stores the received BIOS log in the BIOS log storage area 121. The log notification unit 111 may change the setting of the serial port by changing a setting parameter 115 of the serial port.
When the POST program 113 hangs up due to a certain failure occurring during the execution of the POST program 113, the hang-up detection unit 123 detects a hang-up of the BIOS. Then, the hang-up detection unit 123 stores an event log indicating that the BIOS has hung up, in the event log storage area 122. A maintenance worker or a developer may check the event log stored in the event log storage area 122 through a user interface (UI) provided by the BMC 102, or the like.
The POST code transmission unit 112 transfers a POST code indicating the BIOS booting status to the BMC 102 at a point preset by the developer during the execution of the POST program 113. The POST code is a code indicating how far POST has been performed. In
The BMC 102 stores the received POST code in the POST code storage area 124. The POST code in the POST code storage area 124 is updated to the latest POST code as the POST progresses, and is used by the maintenance worker or the developer to identify a rough suspicious range when the BIOS is not normally booted due to failure occurrence.
For example, it is assumed that the BIOS hangs up while the POST code of the module 114-i remains in the POST code storage area 124. In this case, it may be seen that the POST code of the module 114-i has been successfully transmitted, but the POST code of the module 114-(N−1) has not been successfully transmitted. Therefore, it may be seen that the BIOS hangs up between the start of the execution of the module 114-i and the start of the execution of the module 114-(N−1).
Next, the maintenance worker analyzes the various logs using a log analysis tool (operation 202), and determines whether a suspicious location may be identified by the log analysis tool (operation 203). When it is determined that the suspicious location may be identified (“YES” in operation 203), the maintenance worker displays the suspicious location using the log analysis tool (operation 205). In the meantime, when it is determined that the suspicious location may not be identified (“NO” in operation 203), the maintenance worker requests a developer of a development department to investigate (operation 204)
In the meantime, when the suspicious location may not be identified (“NO” in operation 302), the developer creates a BIOS program in which the BIOS log is enhanced to identify the suspicious location (operation 303). In this case, the developer may enhance the BIOS log by increasing the level of detail of the BIOS log and increasing the amount of information.
Next, the developer performs a reproduction test by causing the information processing apparatus to execute the BIOS program in which the BIOS log is enhanced, and collects the enhanced BIOS log (operation 304). Then, the developer manually analyzes the enhanced BIOS log (operation 305), and repeats the operations after operation 302. The operations of operation 302 to operation 305 are repeated until a suspicious location is identified.
Meanwhile, since the initialization of a high-speed device such as a USB (Universal Serial Bus) port is not completed when the BIOS is booted, the BIOS log is often output via a serial port. The transfer rate of the serial port is about 100 kbps, and the instruction execution speed of the CPU represented by a clock frequency of about several GHz is tens of thousands times higher than the transfer speed of the serial port.
Therefore, the booting time of the BIOS depends on the time for which the BIOS log is transferred to the BMC via the serial port, and becomes longer in proportion to the information amount of the BIOS log to be output. Therefore, the BIOS is designed to output only the minimum BIOS log.
However, when the BIOS hangs up, there may be a case where the suspicious location may not be identified due to the lack of the BIOS log only with the minimum BIOS log. In addition, for example, even when it is possible to identify a suspicious location up to a module from the minimum BIOS log, since the amount of information of BIOS log is not sufficient, it may not be identified which component related to a specified module is the cause, which may result in low accuracy of identification of the suspicious location. In this case, in order to clarify the root cause, the developer often creates the BIOS in which a BIOS log for identifying a suspicious location is enhanced, and performs a reproduction test.
The POST program 113 executed at the time of booting of the BIOS includes other device initialization/test modules. Examples of the other device initialization/test modules may include a PCI (Peripheral Component Interconnect) Bus Scan module which initializes and tests a PCI card. In the PCI Bus Scan module, the amount of information of BIOS log is previously adjusted to an initial value of a predetermined amount so that a large amount of BIOS log is not output.
(P1) The log analysis tool identifies, from the collected BIOS log, a part that has hung up during execution of the PCI Bus Scan module.
In the example of
(P2) The log analysis tool acquires the identification information of the PCI device from the BIOS log output last.
Information “Segment:0000, Bus:03, Device:0a, Function:00” is acquired as the identification information of the PCI device from the BIOS log on the last row of
(P3) The log analysis tool collates the acquired identification information of the PCI device with the configuration information of the information processing apparatus to narrow down the suspicious locations.
By collating the information “Segment:0000, Bus:03, Device:0a, Function:00” with the configuration information of the information processing apparatus, a mounting location of the PCI card which is the cause of the failure occurrence is identified.
However, in this method, even when the PCI card mounting location is identified, it is difficult to determine whether the PCI card itself is faulty or the PCI slot on which the PCI card is mounted is faulty, which may result in low accuracy of identification of the suspicious location. In order to identify whether the suspicious location is a PCI card or a PCI slot, it is desirable to refer to information stored in the register of each of the PCI card and the PCI slot (register information).
When the amount of information is increased by adding register information to the BIOS log, it is possible to identify whether the suspicious location is the PCI card or the PCI slot, which improves the accuracy of identification of the suspicious location. However, as the amount of information in the BIOS log is increased, the BIOS booting time becomes longer.
When the detection unit 513 detects the occurrence of a failure while the program processing unit 512 re-executes the monitoring target program 521 in which the amount of information of log is set to the second setting value, the analysis unit 515 analyzes a log output from the monitoring target program 521 (operation 604).
When the occurrence of a failure is not detected by the execution timing of the monitoring target program 521 while the program processing unit 512 re-executes the monitoring target program 521 in which the information amount of log is set to the second setting value, the controller 514 sets the amount of information of log to a third setting value which is smaller than the second setting value (operation 603). Then, the controller 514 instructs the program processing unit 512 to re-execute the monitoring target program 521.
When the detection unit 513 detects the occurrence of a failure while the program processing unit 512 re-executes the monitoring target program 521 in which the amount of information of log is set to the third setting value, the analysis unit 515 analyzes a log output from the monitoring target program 521 (operation 604).
According to the information processing apparatus 501 of
The extension devices 715-1 to 715-M are, for example, extension cards, and are mounted in the extension slots 714-1 to 714-M, respectively. The external storage device 716 is connected to the extension device 715-2. The BMC 719 is connected to the interface 717 and the serial port 718.
The memory 712 is, for example, a semiconductor memory such as a RAM (Random Access Memory). The nonvolatile memory 713 corresponds to the storage unit 511 in
The extension devices 715-j (j=1, 3 to M) are a video card, a sound card, a network interface, a storage interface, and the like. The external storage device 716 is, for example, a magnetic disk device, an optical disk device, a magneto-optical disk device, a tape device, or the like. The external storage device 716 may be a hard disk drive. The memory 712, the nonvolatile memory 713, and the external storage device 716 are computer-readable and physical (non-transitory) recording media.
The BMC 719 is a control device that manages hardware included in the information processing apparatus 701 and monitors the operation of the information processing apparatus 701. The hardware included in the information processing apparatus 701 corresponds to, for example, a system board of a server or the like. The interface 717 and the serial port 718 are communication interfaces, and the CPU 711 communicates with the BMC 719 via the interface 717 and the serial port 718.
The memory 812 is, for example, a semiconductor memory such as a RAM. The nonvolatile memory 813 is a semiconductor memory such as a ROM, a flash memory, or the like, and stores a BMC image 821 including a BMC program. The CPU 811 operates as the detection unit 513, the controller 514, and the analysis unit 515 in
The interface 814 and the serial port 815 are communication interfaces, and the CPU 811 communicates with the CPU 711 via the interface 814 and the serial port 815.
When a hang-up of the BIOS is detected during execution of the POST program 113, the BIOS is rebooted by the BMC 719, and the BIOS and the BMC 719 perform a diagnosis process to identify a suspicious location of failure occurrence. In the diagnosis process, the information amount of BIOS log output from each module 114-i (i=1 to N) is changed and the POST program 113 is re-executed.
A diagnosis start level 911 is an index indicating the amount of information of BIOS log output from each module 114-i in the diagnosis process. The diagnosis start level 911 is stored, for example, in the nonvolatile memory 713 of
When the diagnosis process is normally ended, the end notification unit 912 notifies the BMC 719 of the normal end via the interface 717. The monitoring unit 913 monitors the execution status of each module 114-i while the POST program 113 is being executed, and sets the diagnosis start level 911.
The log controller 914 performs a process of thinning out the BIOS log output during the execution of the POST program 113 according to the information acquired from the BMC 719. The log notification unit 915 transfers the BIOS log output during the execution of the POST program 113 to the BMC 719 via the serial port 718. The log notification unit 915 may change the setting of the serial port 718 by changing a setting parameter 917. The POST code transmission unit 916 transfers a POST code to the BMC 719 via the interface 717 at a preset location during the execution of the POST program 113.
The setting completion flag 1011 indicates whether the setting parameter 917 in
The hang-up location information 1012 indicates a failure occurrence location of the POST program 113 when a BIOS hang-up is detected. An example of the hang-up location information 1012 may include identification information of the module 114-I, a POST code, or the like when the BIOS hang-up is detected.
The diagnosis level 1013 is an index indicating a setting value of the information amount of BIOS log output from each module 114-i in the diagnosis process. For example, an integer in the range of 0 to “MAX” is used as the diagnosis level 1013. The symbol “MAX” indicates an integer of 1 or more and represents the maximum value of the diagnosis level 1013. At normal booting of the BIOS, the diagnosis level 1013 is set to 0, which is an initial value.
As the value of the diagnosis level 1013 becomes larger, the level of detail of the BIOS log becomes higher and the amount of information increases. For example, when the diagnosis level 1013 is 0, a BIOS log of the amount of information of an initial value is output. When the diagnosis level 1013 is 1 or more, a BIOS log more detailed than the initial value is output. The level of detail of the BIOS log may be enhanced by including the register information acquired from the register of each component in the information processing apparatus 701 in the BIOS log or increasing the number of register information.
For example, the amount of information of BIOS log with the diagnosis level 1013 of 0 corresponds to the first setting value, the amount of information of BIOS log with the diagnosis level 1013 of MAX corresponds to the second setting value, and the amount of information of BIOS log with the diagnosis level 1013 of 1 to MAX−1 corresponds to the third setting value.
The module 114-i in
For example, at normal booting of the BIOS, during execution of the PCI Bus Scan module included in the POST program 113, the BIOS log as illustrated in
In this manner, as the value of the diagnosis level 1013 becomes larger, the amount of information may increase by increasing the type of information included in the BIOS log by adding register identification information or adding the register information stored in the register.
In addition, the developer may determine the kind of information to be included in each of the BIOS logs with the diagnosis levels 1013 of 0 to MAX. For example, as the value of the diagnosis level 1013 becomes larger, the number of registers to be acquired for acquiring register information among the registers included in each component may be increased. In addition, for only a specified component, as the value of the diagnosis level 1013 becomes larger, the number of registers to be acquired may be increased.
When a hang-up of the BIOS is detected, the register information up to one of the row numbers “3002” to “3017” in
In addition, when the BIOS hangs up due to an unexpected value stored in the register, the cause may be removed by replacing the PCI card or the PCI slot, but there may be a problem with the BIOS itself. In this case, by analyzing the BIOS log including the register information, since the failure occurrence location and the cause of the failure occurrence may be identified more accurately, it is possible to determine the necessity of BIOS correction.
The diagnosis level 1013 is set as the diagnosis start level 911 in
The BMC 719 includes a diagnosis log storage area 1015, a BIOS log storage area 1016, an event log storage area 1017, and a POST code storage area 1021. These storage areas are formed in, for example, the memory 812 in
The diagnosis log storage area 1015 and the BIOS log storage area 1016 store BIOS logs received from the log notification unit 915 in
The CPU 811 operates as a switching unit 1018, a log analysis unit 1019, a hang-up detection unit 1020, a hang-up location analysis unit 1022, and a determination unit 1023 by executing a BMC program. The hang-up detection unit 1020 and the log analysis unit 1019 correspond to the detection unit 513 and the analysis unit 515 in
When the POST program 113 hangs up, the hang-up detection unit 1020 detects hang-up of the BIOS, and stores an event log indicating that the BIOS has hung up, in the event log storage area 1017.
For example, the hang-up detection unit 1020 has a function of a watchdog timer, and the BIOS sets a predetermined time in the watchdog timer and causes the watchdog timer to start counting at the start of POST. Then, the BIOS periodically resets the watchdog timer during execution of the POST. Even when a predetermined time has elapsed after the watchdog timer was last reset, when the watchdog timer is not reset, the watchdog timer times out. Therefore, the hang-up detection unit 1020 may detect hang-up of the BIOS by detecting the timeout of the watchdog timer.
The hang-up location analysis unit 1022 analyzes the BIOS log stored in the BIOS log storage area 1016 to identify a failure occurrence location, and generates hang-up location information 1012 indicating the identified failure occurrence location.
When the hang-up of the BIOS is detected, the determination unit 1023 determines whether the detected hang-up is a hang-up that occurs during the normal booting of the BIOS or a hang up that recurs during the diagnosis process. In the meantime, when the hang-up of the BIOS is not detected, the determination unit 1023 determines whether the BIOS has been normally booted, based on the end flag 1014.
When a hang-up is detected during the normal booting of the BIOS, the switching unit 1018 changes the setting value of the amount of information of BIOS log by changing the diagnosis level 1013 from 0 to MAX, and instructs the CPU 711 to reboot the BIOS.
When a hang-up is not detected during the rebooting of the BIOS when the diagnosis level 1013 is MAX, the switching unit 1018 gradually increases the amount of information of BIOS log by decrementing the diagnosis level 1013 by one from MAX to 1. Then, the switching unit 1018 instructs the CPU 711 to reboot the BIOS at each stage where the diagnosis level 1013 is set to a value in the range of MAX−1 to 1.
When a hang-up is detected during the rebooting of the BIOS in a state where the diagnosis level 1013 is set to any value of MAX to 1, the log analysis unit 1019 identifies a suspicious location by analyzing the diagnosis log stored in the diagnosis log storage area 1015.
According to the information processing apparatus 701 of
In the diagnosis process, first, the amount of information of the BIOS log is set to the maximum, and the most detailed BIOS log is collected. At this time, even when the first failure is not reproduced due to a timing failure, the operation of the information processing apparatus 701 approaches the operation at the time of failure occurrence by repeating the rebooting while gradually decreasing the amount of information of the BIOS log. Therefore, the failure is reproduced at any stage, and the BIOS log more detailed than that at the time of first booting is collected, which makes it possible to identify a suspicious location with high accuracy.
In addition, by analyzing the detailed BIOS log by the BMC 719, it is possible to automatically identify a suspicious location without intervention of a maintenance worker or a developer.
A part or all of the end notification unit 912, the monitoring unit 913, and the log controller 914 in
Further, a part or all of the switching unit 1018, the log analysis unit 1019, the hang-up location analysis unit 1022, and the determination unit 1023 in
When the diagnosis level 1013 is 0, it is determined that the hang-up occurred during the normal booting of the BIOS. When the diagnosis level 1013 is 1 or more, it is determined that the hang-up recurred during the diagnosis process.
When the diagnosis level 1013 is 0 (“YES” in operation 1302), the hang-up location analysis unit 1022 is activated to perform a hang-up location analysis process (operation 1304), and the switching unit 1018 performs a switching process (operation 1305). In the meantime, when the diagnosis level 1013 is 1 or more (“NO” in operation 1302), the log analysis unit 1019 is activated to perform a log analysis process (operation 1303).
When a hang-up of the BIOS has not been detected (“NO” in operation 1301), the determination unit 1023 checks the value of the diagnosis level 1013 (operation 1306). When the diagnosis level 1013 is 0 (“YES” in operation), the hang-up detection unit 1020 repeats the process of operation 1301.
In the meantime, when the diagnosis level 1013 is 1 or more (“NO” in operation 1306), the determination unit 1023 checks the value of the end flag 1014 (operation 1307). When the end flag 1014 is logic “0” (“NO” in operation 1307), the hang-up detection unit 1020 repeats the process of operation 1301.
In the meantime, when the end flag 1014 is logic “1” (“YES” in operation 1307), the determination unit 1023 determines that the hang-up of the BIOS has not recurred in the diagnosis process. Therefore, the determination unit 1023 checks the value of the diagnosis level 1013 (operation 1308). When the diagnosis level 1013 is 2 or more (“NO” in operation 1308), the switching unit 1018 performs a switching process (operation 1305).
In the meantime, when the diagnosis level 1013 is 1 (“YES” in operation 1308), the determination unit 1023 determines that the amount of information of the BIOS log has reached a predetermined amount by gradual decrease. Therefore, the determination unit 1023 checks the value of the setting completion flag 1011 (operation 1309). When the setting completion flag 1011 is logic “0” (“NO” in operation 1309), the determination unit 1023 instructs the log controller 914 in
When instructed by the determination unit 1023 to reduce the BIOS log, the log controller 914 reduces the amount of information of the BIOS log output to the serial port 718 by thinning out the BIOS log output during execution of the next POST program 113. For example, the log controller 914 may reduce the amount of information of a log by thinning out a text of the BIOS log so that the text of the BIOS log is output at intervals of K characters (K is an integer of 1 or more).
As a result, since the time for which the BIOS log is transferred to the BMC 719 via the serial port 718 is reduced, the operation of the information processing apparatus 701 approaches the operation at the time of failure occurrence, which leads to a high possibility of reproduction of the failure. When the failure is reproduced, a thinned-out BIOS log is collected.
In
The BIOS log in
When the setting completion flag 1011 is logic “1” (“YES” in operation 1309), the determination unit 1023 determines that the failure is not reproduced even when the BIOS log is thinned out. Therefore, the determination unit 1023 stores an event log indicating that the failure has not been reproduced, in the event log storage area 1017, and ends the process.
In addition, the log controller 914 may repeat the process of thinning out the BIOS log a plurality of times instead of only once. In this case, the text of the BIOS log to be output decreases gradually such as at intervals of K characters, at intervals of (K+1) characters, or at intervals of (K+2) characters. Further, the log controller 914 may adjust the transfer time of the serial port 718 more finely by setting the baud rate of the serial port 718 together.
When the identification information of the hung-up module 114-i is identified (“YES” in operation 1502), the hang-up location analysis unit 1022 generates hang-up location information 1012 indicating the identification information of the module 114-i (operation 1503).
In the meantime, when the identification information of the hung-up module 114-i is not identified (“NO” in operation 1502), the hang-up location analysis unit 1022 acquires the POST code stored immediately before the hang-up, from the POST code storage area 1021. Then, the hang-up location analysis unit 1022 generates hang-up location information 1012 indicating the acquired POST code (operation 1504).
When the switching process is performed following the process of operation 1304, the diagnosis level 1013 is changed from 0 to MAX in operation 1601. The value of MAX may be a value common to the modules 114-1 to 114-N, or may be different for each hung-up module 114-i.
When the switching process is performed following the process of operation 1308, the diagnosis level 1013 is decremented by 1 in operation 1601. When the switching process is performed following the process of operation 1310, the diagnosis level 1013 is set to 1 in operation 1601.
Next, the log analysis unit 1019 stores an event log in the event log storage area 1017 (operation 1703). When the suspicious location is identified, an event log indicating that the suspicious location is identified is stored in the event log storage area 1017. When the suspicious location is not identified, an event log indicating that the suspicious location is not identified is stored in the event log storage area 1017.
Next, the log analysis unit 1019 erases the hang-up location information 1012 (operation 1704), and initializes the diagnosis level 1013 by changing the diagnosis level 1013 to 0 (operation 1705).
Next, the monitoring unit 913 performs a diagnosis start level setting process (operation 1803), and the CPU 711 executes the module 114-i of the POST program 113 (operation 1804). The modules 114-1 to 114-N are sequentially executed from the module 114-1, and the next module 114-i is executed each time the process of operation 1804 is repeated.
Next, the CPU 711 checks the value of the diagnosis start level 911 (operation 1805). When the diagnosis start level 911 is 0 (“YES” in operation 1805), the CPU 711 causes the executed module 114-i to output the BIOS log of the information amount of the initial value (operation 1807). In this case, the log notification unit 915 transfers the BIOS log with the BIOS diagnosis level 1013 of 0 to the BMC 719 via the serial port 718.
In the meantime, when the diagnosis start level 911 is not 0 (“NO” in operation 1805), the CPU 711 causes the executed module 114-i to output the BIOS log of the information amount according to the diagnosis start level 911 (operation 1806). In this case, the log notification unit 915 transfers the BIOS log with the BIOS diagnosis level 1013 of any of MAX to 1 to the BMC 719 via the serial port 718.
Next, the POST code transmission unit 916 transfers a POST code to the BMC 719 via the interface 717 (operation 1808). Then, the monitoring unit 913 checks whether the BIOS has been booted normally (operation 1809). When the BIOS has not been booted normally (“NO” in operation 1809), the CPU 711 repeats the processes after operation 1803.
When the BIOS has been booted normally (“YES” in operation 1809), the monitoring unit 913 instructs the BMC 719 to change the end flag 1014 (operation 1810), and the determination unit 1023 in
When the hang-up location information 1012 indicates identification information of any module 114-p (p=1 to N) (“NO” in operation 1902), the monitoring unit 913 acquires identification information of a module 114-i to be executed next (operation 1905). Then, the monitoring unit 913 compares the identification information of the module 114-i with the identification information of the module 114-p (operation 1906).
When the identification information of the module 114-i is equal to the identification information of the module 114-p (“YES” in operation 1906), the monitoring unit 913 acquires the diagnosis level 1013 from the BMC 719 via the interface 717 (operation 1907). Then, the monitoring unit 913 sets the value of the acquired diagnosis level 1013 as the diagnosis start level 911 (operation 1908). In the meantime, when the identification information of the module 114-i is different from the identification information of the module 114-p (“NO” in operation 1906), the monitoring unit 913 ends the process.
When the hang-up location information 1012 indicates the POST code (“YES” in operation 1902), the monitoring unit 913 acquires the POST code last transferred by the POST code transmission unit 916 (operation 1903). Then, the monitoring unit 913 compares the last transferred POST code with the POST code indicated by the hang-up location information 1012 (operation 1904).
When the transferred POST code is equal to the POST code indicated by the hang-up location information 1012 (“YES” in operation 1904), the monitoring unit 913 performs the processes after operation 1907. In the meantime, when the transferred POST code is different from the POST code indicated by the hang-up location information 1012 (“NO” in operation 1904), the monitoring unit 913 ends the process.
According to the diagnosis start level setting process of
For example, when the diagnosis level 1013 is MAX, the CPU 711 adjusts the information amount of the BIOS log output at the failure occurrence location to the information amount with the diagnosis level 1013 of MAX. In addition, when the diagnosis level 1013 is decremented by one from MAX to 1, the CPU 711 adjusts the information amount of the BIOS log output at the failure occurrence location to the information amount with the diagnosis level 1013 of MAX−1 to 1 at each stage. As a result, the information amount of the BIOS log at the failure occurrence location is adjusted in accordance with the diagnosis level 1013.
According to the switching control process of
As a result, the amount of information of the BIOS log decreases gradually and the operation of the information processing apparatus 701 approaches the operation at the time of failure occurrence, which increases the possibility of reproduction of the failure. When the failure is reproduced, a more detailed BIOS log than that at the first booting is collected, which makes it possible to identify a suspicious location with high accuracy.
When the thinned-out BIOS log is collected by performing the control of operation 1310 of
At this time, the developer compares the BIOS log with the diagnosis level 1013 of 1 collected when the BIOS is booted normally, with the thinned-out BIOS log collected when the BIOS hangs up. Then, the developer supplements the thinned-out portion by associating these BIOS logs.
As illustrated in
When the suspicious location may be identified (“YES” in operation 2002), the developer ends the analysis operation. In the meantime, when the suspicious location may not be identified (“NO” in operation 2002), the developer creates a BIOS program in which the BIOS log is enhanced (operation 2003).
Next, the developer performs a reproduction test by causing the CPU 711 to execute the BIOS program in which the BIOS log is enhanced, and collects the enhanced BIOS log (operation 2004). Then, the developer manually analyzes the enhanced BIOS log (operation 2005) and identifies a suspicious location (operation 2006).
When the suspicious location may be identified in operation 2002, operations 2003 to 2006 become unnecessary and the analysis operation ends immediately. Even when the suspicious location may not be identified in operation 2002, the suspicious location may be roughly estimated since the more detailed information than the BIOS log with the diagnosis level 1013 of 0 is acquired.
Therefore, by performing the reproduction test in which the BIOS log is enhanced only once, there is a high possibility that the suspicious location may be identified, and it is not necessary to repeat the reproduction test a plurality of times as in the investigation operation of
The CPU 101 and the BMC 102 in
The configuration of the information processing apparatus illustrated in
The configurations of the CPU 711 in
The flowcharts of
The BIOS logs illustrated in
While the disclosed embodiments and the advantages thereof have been described in detail, it should be understood by those skilled in the art that various changes, additions, and omissions may be made without departing from the spirit and scope of the present disclosure as set forth in the claims.
All examples and conditional language recited herein are intended for pedagogical purposes to aid the reader in understanding the invention and the concepts contributed by the inventor to furthering the art, and are to be construed as being without limitation to such specifically recited examples and conditions, nor does the organization of such examples in the specification relate to an illustrating of the superiority and inferiority of the invention. Although the embodiments of the present invention have been described in detail, it should be understood that the various changes, substitutions, and alterations could be made hereto without departing from the spirit and scope of the invention.
Number | Date | Country | Kind |
---|---|---|---|
2018-215918 | Nov 2018 | JP | national |