An embodiment of the invention relates to generating core dump on a multiple partitioned platform.
A core dump represents a snapshot of a computer system at a specific time. When a problem occurs in the computer system, analyzing a core dump is a useful method in determining the causes of the problem. The core dump is generally used to debug a program or a system that has terminated abnormally, for example, a system crash. The core dumpt typically refers to a file containing a memory image of a particular process, or the memory images of parts of the address space of that process, complete, unstructured state of the dumped memory regions
The core dump provides information such as the memory usage or the processes running at the time the problem arises in the computer system. The method of troubleshooting using the core dump may be described in two general steps. First, a core dump is generated. Second, the core dump is either stored on a specific memory space managed by the core dump device or the core dump is transferred out of the computer system to be analyzed.
Generally, a dumping device driver is installed on a computer system and managed by an operating system running on that computer system. When a problem occurs, the dumping device driver gathers information on the computer system and generates a core dump. More specifically, the core dump is related to the operating system and the processes running on that operating system at the time the system failure occurs. When a core dump is generated, it is usually stored in a memory space allocated for that operating system.
When a problem occurs at a computer system, the dumping device may be corrupted by the problem that causes the computer system failure. The corrupted dumping device may generate unreliable kernel images such a tainted kernel images or no images at all. Examples of a tainted kernel image may be a partial kernel image or a kernel image that contains incorrect core dump information. A tainted kernel image or a complete lack of kernel image does not assist in troubleshooting a problematic computer system.
Another method in obtaining a core dump is to use a network based dump tools. This method uses a dumping device recites remotely on another system different from the problem system. When a problem occurs on a computer system and requires a core dump, a remote dumping device may not be corrupted. Therefore, a remote dumping device may generate a more reliable core dump than a dumping device reciting on the same problem system.
However, depending on the problem system, the size of a core dump may be extremely large. For example, a core dump of a high end server may require 16 GB of storage space. Bandwidth may be an issue when transferring a core dump of this size over a network off the problem system. In addition, the network may not be reliable enough to transmit the core dump of this size.
Various embodiments are illustrated by way of example and not by way of limitation in the figures of the accompanying drawings in which like references indicate similar elements. It should be noted that references to “an,” “one,” or “various” embodiments in this disclosure are not necessarily to the same embodiment, and such references mean at least one.
A method for providing a reliable kernel core dump on a multi-core platform is described herein. A person of ordinary skill in the pertinent art, upon reading the present disclosure, will recognize that various novel aspects and features of the present invention can implemented independently or in any suitable combination, and further, that the disclosed embodiments are merely illustrative and not meant to be limiting.
During a computer system boot up process, each instance of an operating system is loaded into a partition of the main system memory 120. As shown in
The partitioning of these separate memory spaces may be done by a firmware 140. In one embodiment of the invention, the firmware may be stored in a basic input/out system (BIOS). The BIOS is generally responsible for initializing and configuring system hardware and software resources.
An example of the firmware 140 would be a PRL firmware currently used by the Intel™ 915G chipset. PRL firmware is a modified version of Tiano™ firmware. A Tiano™ firmware is an example of an Extensible Firmware Interface (EFI). The firmware 140 such as the PRL firmware divides the system resource during the boot phase. Such division of memory space may be referred to as “soft partitioning.”
Dividing main system memory into multiple partitions for multiple operating systems may include allocating memory space to be used by the corresponding operating system (operation 160). The allocation of separate memory space may be accomplished pursuant to the soft partitioning process 154 (e.g. operation 160) or the allocation may be accomplished during the soft partitioning process 154. Furthermore, a shared memory may be allocated to be accessible by the multiple operating systems (operation 162).
In one embodiment of the invention, each thread maintains an advanced configuration and power interface (ACPI) table. Each table includes a list of resources that will be initiated, configured and maintained by each thread. As shown in
After a core dump is generated, the module 405 stores the core dump in a shared memory 430. As described above in
An interrupt handler 407 may be installed as part of a sequestered partition 404. The interrupt handler 407 may be used detect an interrupt sent by the operating system running in the main partition 402 when a core dump is generated in the main partition 402. In one embodiment of the invention, an interprocessor bridge (IPB) library may be used to communicate between the two partitions.
After the core dump is generated, it is stored in a shared memory (operation 454). The shared memory is accessible by a second partition. Then an interrupt is sent and to notify the generation of the core dump in the first partition (operation 456). In operation 458, the interrupt is detected by the second partition. Upon the detection of the interrupt, the core dump is copied from the shared memory to a kernel buffer in the second partition (operation 460). In operation 462, the core dump is ready for analysis. In one embodiment of the invention, a user memory space application from the second partition may copy the core dump from the kernel buffer into a user memory space. In one embodiment of the invention, a memory based character driver may be used to extract the core dump from the shared memory and copy it to the user memory space.
It should be noted that a system failure may occur in the sequestered partition instead of the main partition as discussed in the examples previous. A person skilled in the art would appreciate that in an event a system failure occurs in the sequestered partition or in a partition other than the main partition, a core dump generated on the failed partition may be retrieved in the method described above. For example, if a core dump is generated on the sequestered partition due to a failure on this partition or an event triggered by a user, the core dump may be stored on the shared memory and accessible by the main partition.
Computer system 600 further comprises a random access memory (RAM) or other dynamic storage device 604 (referred to as main memory) coupled to bus 611 for storing inf6ormation and instructions to be executed by main processing unit 612. Main memory 604 also may be used for storing temporary variables or other intermediate information during execution of instructions by main processing unit 612.
Firmware 603 may be a combination of software and hardware, such as Electronically Programmable Read-Only Memory (EPROM) that has the operations for the routine recorded on the EPROM. The firmware 603 may embed foundation code, basic input/output system code (BIOS), or other similar code. The firmware 603 may make it possible for the computer system 600 to boot itself.
Computer system 600 also comprises a read-only memory (ROM) and/or other static storage device 606 coupled to bus 611 for storing static information and instructions for main processing unit 612. The static storage device 606 may store OS level and application level software.
Computer system 600 may further be coupled to or have an integral display device 621, such as a cathode ray tube (CRT) or liquid crystal display (LCD), coupled to bus 611 for displaying information to a computer user. A chipset may interface with the display device 621.
An alphanumeric input device (keyboard) 622, including alphanumeric and other keys, may also be coupled to bus 611 for communicating information and command selections to main processing unit 612. An additional user input device is cursor control device 623, such as a mouse, trackball, trackpad, stylus, or cursor direction keys, coupled to bus 611 for communicating direction information and command selections to main processing unit 612, and for controlling cursor movement on a display device 621. A chipset may interface with the input output devices. Similarly, devices capable of making a hardcopy 624 of a file, such as a printer, scanner, copy machine, etc. may also interact with the input output chipset and bus 611.
Another device that may be coupled to bus 611 is a power supply such as a battery and Alternating Current adapter circuit. Furthermore, a sound recording and playback device, such as a speaker and/or microphone (not shown) may optionally be coupled to bus 611 for audio interfacing with computer system 600. Another device that may be coupled to bus 611 is a wireless communication module 625. The wireless communication module 625 may employ a Wireless Application Protocol to establish a wireless communication channel. The wireless communication module 625 may implement a wireless networking standard such as Institute of Electrical and Electronics Engineers (IEEE) 802.11 standard, IEEE std. 802.11-1999, published by IEEE in 1999.
In one embodiment, the software used to facilitate the above routines or fabricate the above components can be embedded onto a machine-readable medium. A machine-readable medium includes any mechanism that provides (i.e., stores and/or transmits) information in a form accessible by a machine (e.g., a computer, network device, personal digital assistant, manufacturing tool, any device with a set of one or more processors, etc.). For example, a machine-readable medium includes recordable/non-recordable media (e.g., read only memory (ROM) including firmware; random access memory (RAM); magnetic disk storage media; optical storage media; flash memory devices; etc.), as well as electrical, optical, acoustical or other form of propagated signals (e.g., carrier waves, infrared signals, digital signals, etc.); etc.
Although the invention has been described in detail hereinabove, it should be appreciated that many variations and/or modifications and/or alternative embodiments of the basic inventive concepts taught herein that may appear to those skilled in the pertinent art will still fall within the spirit and scope of the present invention as defined in the appended claims.