System availability, scalability, and data integrity are fundamental characteristics of enterprise systems. A nonstop performance capability is imposed in financial, communication, and other fields that use enterprise systems for applications such as stock exchange transaction handling, credit and debit card systems, telephone networks, and the like. Highly reliable systems are often implemented in applications with high financial or human costs, in circumstances of massive scaling, and in conditions that outages and data corruption cannot be tolerated.
In a continuously available system, a user expects and desires end-to-end application availability, a capability to achieve a desired operation within an acceptable response time. A scalable system is expected to achieve near 100% linear scalability, scaling from a few processors to thousands in a manner that the amount of useful work done when the nth processor is added to a cluster is essentially the same as the incremental amount of work accomplished when the second, third, or additional processors are added.
One aspect of a highly available and reliable system is a capability to analyze failure events, enabling treatment and possible prevention of such failure conditions. A useful tool for diagnosing system difficulties in a computing system is a memory dump, an output file generated by an operating system during a failure for usage in determining the cause of the failure.
Unfortunately, capture, handling, and storage of diagnostic information all can impose on performance and availability of a system. For a sophisticated, complex, or large system, acquisition and handling of diagnostic information can last for many minutes or possibly hours, thereby compromising system availability.
Various techniques are used to reduce time expended in writing debugging information and limit the storage space for storing a crash dump file in the event of a fatal error. For example, the stored debug information may be reduced to cover only the operating system or kernel level memory, allowing analysis of nearly all kernel-level system errors. Unfortunately the kernel-level system dump remains large enough to compromise availability. An even smaller memory dump may be acquired to cover only the smallest amount of base-level debugging information, typically sufficient only to identify a problem.
In accordance with an embodiment of a computing system, a plurality of redundant, loosely-coupled processor elements are operational as a logical processor. A logic detects a halt condition of the logical processor and, in response to the halt condition, reintegrates and commences operation in less than all of the processor elements leaving at least one processor element nonoperational. The logic also buffers data from the nonoperational processor element in the reloaded operational processor elements and writes the buffered data to storage for analysis.
Embodiments of the invention relating to both structure and method of operation, may best be understood by referring to the following description and accompanying drawings whereby:
Various techniques may be used to capture a memory dump of a computing system. In one example, a computing system composed of multiple redundant processors can capture a memory dump by having a central processing unit, for example running an operating system that supports multiple redundant processors such as the NonStop Kernel™ made available by Hewlett Packard Company of Palo Alto, Calif., copy the memory of a non-executing central processing unit. In a particular example, the non-executing or “down” processor can run a Halted State Services (HSS) operating system. The executing processor copies the memory dump data from the non-executing processor back to the executing processor's memory and subsequently writes the data to a storage device, such as a disk file.
In a specific example, a pre-Fast-Memory-Dump (pre-FMD) technique can be used in which the running processor copies raw data from the down processor memory and compresses the raw data in the running processor. In an alternative post-FMD method, the raw data is compressed in the down processor, for example under HSS, and the compressed data is moved via network communications, such as ServerNet™ to the running processor which writes the compressed data to storage, such as a memory dump disk. A further alternative Fast-Memory-Dump enhancement involves copying only part of the memory, either compressed or noncompressed data, from the down processor to the running processor, then reloading the operating environment to the memory part to begin execution, then copying the remainder of the memory after completion of the reload that returns the memory to the running system.
The described dump techniques use capture of memory contents prior to reloading the processor, for example by copying the memory to the system swap files or special dump files. The copying time may be mitigated, for example by copying part of memory and reloading into only that copied part, then recopying the remainder of the data after the processor is reloaded, returning memory to normal usage after the copy is complete. The method reduces the time expenditure, and thus system down-time, for capturing and storing the memory dump data, but significantly impacts processor performance since only a subset of the memory is available for normal operation. The techniques further involve a significant delay before normal operation can be started.
In an illustrative example embodiment, a logical processor may include two or three processor elements running the same logical instruction stream. A dual-modular redundant (DMR) logical processor and/or a tri-modular redundant (TMR) logical processor can capture and save a logical processor memory dump while concurrently running an operating system on the sample logical processor. In a fully parallel technique that begins following a computing system halt condition, less than all of the processor elements are reloaded and made operational, leaving at least one processor element in a non-running or “down” condition, not integrated into the logical processor. Memory dump data is copied from the down processor element, for example using a dissimilar data exchange direct memory access (DMA) transfer.
Referring to
The loosely-coupled processors 102 form a combined system of multiple logical processors, with the individual processors 102 assuring data integrity of a computation. The illustrative computing system 100 operates as a node that can be connected to a network 104. One example of a suitable network 104 is a dual-fabric Hewlett Packard ServerNet cluster™. The computing system 100 is typically configured for fault isolation, small fault domains, and a massively parallel architecture. The computing system 100 is a triplexed server which maintains no single point of hardware failure, even under processor failure conditions. The illustrative configuration has three instances of a processor slice 106A, 106B, 106C connected to four logical synchronization units 108. A logical synchronization unit 108 is shown connected to two network fabrics 104.
Individual logical computations in a logical processor are executed separately three times in the three physical processors. Individual copies of computation results eventually produce an output message for input/output or interprocess communication, which is forwarded to a logical synchronization unit 108 and mutually checked for agreement. If any one of the three copies of the output message is different from the others, the differing copy of the computation is “voted out” of future computations, and computations on the remaining instances of the logical processor continue. Accordingly, even after a processor failure, no single point of failure proceeds to further computations. At a convenient time, the errant processing element can be replaced online, reintegrating with the remaining processing elements, restoring the computing system 100 to fully triplexed computation.
In the illustrative configuration, the individual processor slices 106A, 106B, 106C are each associated respectively with a memory 110A, 110B, 110C and a reintegration element 112A, 112B, 112C. The data can be buffered or temporarily stored, typically in the memory associated with the running processor slices. The individual processor slices 106A, 106B, 106C also can include a logic, for example a program capable of execution on the processor elements 102 or other type of control logic, that reloads and reintegrates the nonoperational processor element or slice into the logical processor after the data is buffered.
An asymmetric data dump is desirable to enable analysis into causes of a failure condition. Diagnostic information is more likely to be collected if acquisition can be made without compromising performance and availability.
The computing system 100 can activate capture of diagnostic memory dump information upon a logical processor halt condition. Diagnostic memory dump collection logic reloads the logical processor but does not include all processor slices 106A, 106B, 106C in the reload. In some implementations and in some conditions, the processor slice selected to omit from reload is arbitrarily selected. In other implementations or conditions, the omitted processor slice can be selected based on particular criteria, such as measured performance of the individual slices, variation in capability and/or functionality of the particular processor slices, or the like. The processor slice omitted from the reload is maintained in the stopped condition with network traffic neither allowed into the stopped processor elements 102 nor allowed as output.
No matter the reason that the processor element is halted, the element may remain stopped and the associated memory may be dumped prior to reintegrating the element.
Referring to
After completing the partial reload, a parallel receive dump (PRD) program is automatically initiated 206. A separate instruction stream of the PRD program typically executes in all reloaded processor slices, although other implementations may execute from logic located in alternative locations.
The parallel receive dump (PRD) program creates 208 a dump file and allocates 210 buffers; typically memory associated with the processor slices. All architectural state of the processor elements is typically saved in the memory so that separate collection of the information is superfluous.
The parallel receive dump (PRD) program opens 212 a memory window on a physical partition of memory associated with the particular processor slice executing the PRD program. The parallel receive dump (PRD) program moves 214 the window over the entire partition. In the illustrative embodiment, all processor elements of the logical processor generally have identical partitions so that the window also describes the partition of the omitted processor element.
The parallel receive dump (PRD) program performs a divergent data direct memory access (DMA) 216 operation that transfers data from one processor slice to another via DMA. A specific embodiment can use high order address bits of the source memory address to identify the particular omitted processor slice and/or processor element. For example, four memory views are available using ServerNet™ including the logical processor as a whole, processor slice A, processor slice B, and processor slice C. A direct memory access device reads data from the omitted processor slice and copies the data to a buffer in memory of the two executing processor slices, for example running the NonStop Kernel™.
In a particular implementation of the DMA operation, the source physical address high-order bits denote the one specific processor slice omitted from the reload. The target buffer for the DMA operation is in memory associated with the reloaded processor slices running the operating system and not the memory associated with the stopped processor slice.
The computing system that executes the asymmetric memory dump operation 200 may further include a reintegration logic that restarts and resynchronizes the plurality of processor elements following a failure or service condition. A reintegration process is executable on at least one of the operating processor elements and delays reintegration of the nonoperational processor element until dump processing is complete.
When the input/output operations for transferring the memory dump to buffers in the executing processor slices are complete, the parallel receive dump program compresses 218 the data into a compressed dump format, and writes 220 the compressed data to a storage device, for example a dump file on an external disk storage. Transfer of the compressed dump data is similar to the dump operation of the pre-Fast-Memory-Dump (pre-FMD) technique receive dump operation.
When the parallel receive dump program has completed copying, compressing, and writing the dump data to storage, the parallel receive dump program closes 222 the window to physical memory, closes 224 the storage file, and initiates 226 reintegration of the dumped processor slice.
The omitted processor slice is reintegrated 228 and the dump operation completes.
Referring to
The diagnostic memory dump operation 300 is invoked as a result of the halt condition. The condition causes one of the logical processors to halt while the other logical processors continue to run and do not require reloading. Data from the halted logical processor is dumped to one of the running logical processors, and the dump data is written to storage, such as a disk. The halted processor is then reloaded. Only one logical processor is halted and only one reload takes place. In a reload of a logical processor, a new copy of the operating system is brought up in the logical processor. For reintegration of a processor element, a processor is merged back into a running logical processor.
The diagnostic memory dump method 300 can be implemented in an executable logic such as a computer program or other operational code that executes in the processor elements or in other control elements. The operation begins on detection 302 of a halt condition of at least one processor element of multiple redundant processor elements. In a particular example, a system may implement a pointer to a linked list of various fault and operating conditions that causes execution to begin at a servicing logic for a particular condition. Various conditions may evoke a response that generates a diagnostic memory dump. One processor element, termed a “down” processor element, is maintained 304 in a state existing at the halt condition and the other processor elements are reloaded 306 and enabled to commence execution, for example restarting an operating system such as Hewlett Packard's NonStop Kemel™. One technique for initiating a response to the halt condition issues a command to place the logical processor in a “ready for reload ” state in which the command designates the processor element to be omitted from the reload. The command causes “voting-out” of the omitted processor element and reloads the remaining processor elements for execution.
The state of the down processor maintained at the halt condition is copied 308 to a storage while the reloaded other processors continue executing. For example, once reloading of the processor elements other than the down processor element is complete, the reloaded processor elements can automatically initiate a parallel receive dump program that creates a dump file, allocates buffers in memory associated with the reloaded processors for usage in temporarily storing the memory dump data, and saves the architectural state of all processor elements in memory buffers. In some embodiments, a special direct memory access (DMA) operation can be started that copies memory of the down processor element to buffers in the running processor elements. The divergent data Direct Memory Access (DMA) operation uses a self-directed write whereby a designated source memory address identifies the one processor element maintained in the halt condition.
On completion of the Input/Output operation of the divergent DMA operation, the parallel receive dump program compresses data into dump format, writes 312 the compressed memory dump data to a storage device, and closes the memory window. The divergent data DMA operation can be used to write 312 the memory dump from the buffers to a storage device, for example an external disk storage device for subsequent analysis.
After the data dump is copied from the down processor to the executing processors 308 and possibly concurrent with memory dump is transfer from the down processor to storage, such as the dump file and/or temporary buffer, the down processor element is reintegrated 310 into the logical processor.
The illustrative technique enables the logical processor to immediately return to service following a halt condition, eliminating or avoiding delay otherwise involved for copying of the diagnostic dump memory to disk. The illustrative technique further enables functionality as a highly available system, eliminating or avoiding removal of a logical processor from service for collection of diagnostic information which may take a substantial amount of time. Commonly the transferring of the data to disk can take several minutes, a large amount of time in a highly-available system. The illustrative technique reduces the latency effectively to zero.
Referring to
In the event of a halt condition, a logical processor halts. The computer system 400 is an entire process made up of multiple logical processors. The halted logical processor ceases functioning under the normal operating system, for example the NonStop Kernel system, and enters a state that allows only abbreviated functionality. In one example, the halt condition causes the system to function under halted state services (HSS). A failure monitoring logic, such as software or firmware operating in control logic, detects the halt condition, and selects from among the processor slices 402 for a processor slice to omit from reloading. In various implementations the omitted processor may be selected arbitrarily or based on functionality or operating characteristics, such as performance capability considerations associated with the different processor slices 402. The control logic reloads the operating system into memory for processor slices that are not selected for omission so that the processor slices return to execution. The omitted processor slice remains inactive or “down”, continuing operations under halted state services (HSS).
The reloaded processor slices request capture of a memory dump from the omitted processor memory while the omitted processor slice remains functionally isolated from the operating processor slices. The reloaded processors begin a copy process, for example that executes on the reloaded processors and stores the memory dump data from the omitted processor to memory 406 associated with the operating processor slices 402. The diagnostic dump data passes from the omitted processor slice to the reloaded and operating processor slices via a pathway through a logical synchronization unit 414. The logical synchronization unit 414 is an input/output interface and synchronization unit that can be operated to extract the memory dump data from the omitted processor slice including a copy of the data and a copy of a control descriptor associated with the data. The reintegration logic 410 generally operates in conditions of processor slice failure to reintegrate the operations of the failed slice into the group of redundant slices, for example by replicating write operations of memory for one processor slice to memory of other processor slices in a redundant combination of processor slices.
In a specific example, software executing in one or more of the reloaded and running processor slices performs asymmetric input/output operations that copy memory dump data from the omitted processor slice to memory buffers in both of the operating, reloaded processor slices. Accordingly, the two reloaded processor slices 402 operate temporarily in duplex mode while acquiring and storing the diagnostic dump information before returning to triplex operation.
In a particular embodiment, the microprocessors 404 may be standard Intel Itanium Processor Family multiprocessors that share a partitioned memory system. Each microprocessor may have one or more cores per die. A processor slice 402 with an N-Way Symmetrical Multi-Processor (SMP) supports N logical processors. Each logical processor has an individual system image and does not share memory with any other processor.
Reintegration logic 410 can replicate memory write operations to the local memory and sends the operations across a reintegration link 412 to another slice. The reintegration logic 410 is configurable to accept memory write operations from the reintegration link 412. The reintegration logic 410 can be interfaced between the I/O bridge/memory controller 408 and memory 406, for example Dual In-line Memory Modules (DIMMs). Alternatively, the reintegration logic 410 may be integrated into the I/O bridge/memory controller 408. Reintegration logic 410 is used to bring a new processor slice 402 online by bringing memory state in line with other processor slices.
In an illustrative example, the computer system 400 uses loosely lock-stepped multiprocessor boxes called slices 402, each a fully functional computer with a combination of microprocessors 404, cache, memory 406, and interfacing 408 to input/output lines. All output paths from the multiprocessor slices 402 are compared for data integrity. A failure in one slice 402 is transparently handled by continuing operation with other slices 402 continuing in operation. The computer system 400 executes in a “loose-lock stepping” manner in which redundant microprocessors 404 run the same instruction stream and compare results intermittently, not on a cycle-by-cycle basis, but rather when the processor slice 402 performs an output operation. Loose-lockstep operation prevents error recovery routines and minor non-determinism conditions in the microprocessor 404 from causing lock-step comparison errors.
Referring to
Logical gateway 514 has two independent voter subunits, one for voting Programmed Input/Output (PIO) read and write transactions 516 and a second for voting Direct Memory Access (DMA) read responses 518. The Direct Memory Access (DMA) read response subunit 518 verifies input/output controller-initiated DMA operations or responses from memory and performs checks on read data with processors performing voted-write operations. DMA write traffic is originated by the input/output controller 522 and is replicated to all participating processor slices 504. DMA read traffic is originated by the input/output controller 522. DMA read requests are replicated to all participating slices 504.
Referring to
Referring to
Referring to
Referring to
During operation, the processor slices A 902, B 904, and C 906 generally are configured as multiple tri-modular logical processors that execute in loose lockstep with I/O outputs compared by the voter units 908 before data is written to the System Area Network 914.
The term processor complex 900 is merely descriptive term and does not necessarily define a system enclosed within a single housing. A processor complex is generally not a single field-replaceable unit. In a typical minimum configuration, a field-replaceable unit can include one LSU and one slice.
The voter units 908 are logical gateways of operation and data crossing from the logical synchronization blocks 912 unchecked domain to a self-checked domain. DMA read data and PIO reads and write requests that address the self-checked domain are checked by the voter unit 908 in order of receipt. Operations are not allowed to pass one another and complete before the next is allowed to start. DMA read response data are also checked in the order received and then forwarded to the system area network interface 910, for example a Peripheral Component Interconnect Extended (PCI-X) interface. PIO requests and DMA read responses are processed in parallel with no ordering forced or checked between the two streams.
Referring to
Referring to
If the microprocessor used in the processor slice 1004 has multiple cores, for example the microprocessor die has multiple processor cores, each running an independent instruction stream. The term processor element refers to a single core.
Referring to
The logical processor 1006 is formed from one or more processor elements 1002, for example three in the illustrative embodiment, depending on the number of processor slices 1004 available. A simplex logical processor has only one processor element (PE) per logical processor. A dual-modular redundant (DMR) logical processor has two processor elements (PEs) per logical processor. A tri-modular redundant (TMR) logical processor has three. Each processor element 1002 in a logical processor 1006 runs the same instruction stream, in loosely lock-stepped operation, and output data from multiple processor elements is compared during data input/output (I/O) operations.
Referring again to
In the illustrative embodiment, one or two logical synchronization units 500 are combined with one, two, or three processor elements 1002 to create varying degrees of fault tolerance in the logical processors 1006. A system may optionally be configured with a second logical synchronization unit 500 per logical processor 1006.
The logical synchronization unit 500 may use fully self-checked logic. The logical synchronization unit 500 resides between the replicated processor slices 902, 904, 906 and the system area network 914 and, in some implementations, may not have data compared to preserve data integrity such as for the processor slices. Accordingly, appropriate redundancy techniques may be used in the logical synchronization unit 500 to ensure data integrity and fault isolation.
The voter logic 908 connects the processor slices 902, 904, 906 to the SAN interface 910 and supplies synchronization functionality for the logical processor. More specifically, the voter logic 908 compares data from programmed input/output (PIO) reads and writes to registers in the logical synchronization unit 912 from each of the processor elements. The comparison is called voting and ensures that only correct commands are sent to logical synchronization unit logic. Voter logic 908 also reads outbound data from processor slice memories and compares the results before sending the data to the system area network (SAN), ensuring that outbound SAN traffic only contains data computed, or agreed-upon by voting, by all processor elements in the logical processor. The voter logic 908 also replicates and distributes programmed input/output (PIO) data read from the system area network and registers in the logical synchronization unit 912 to each of the processor elements. The voter logic 908 further replicates and distributes inbound data from the system area network to each of the processor elements. The voter logic 908 can supply time-of-day support to enable processor slices to simultaneously read the same time-of-day value. The voter logic 908 supports a rendezvous operation so that all processor slices can periodically check for mutual synchrony, and cause one or more processor elements to wait to attain synchrony. The voter logic 908 also supports asymmetric data exchange buffers and inter-slice interrupt capability.
The voter logic, shown as the logical gateway 514 in
The logical gateway 514 forwards data to the processor element memories 506 at approximately the same time. However, the processor elements 504 do not execute in perfect lockstep so that data may arrive in memory 506 early or late relative to program execution of the particular processor element 504. No data divergence results among the processor elements since the SAN programming model does not allow access to an inbound buffer until after arrival notification receipt.
The system area network (SAN) interface 522 is generally used for all input/output, storage, and interprocessor communications. The SAN interface 522 communicates with the three processor slices through the logical gateway 514. System area network (SAN) traffic passes to and from a logical processor and not individual processor elements. The logical gateway 514 replicates data from the system area network to the memory of all processor elements 504 participating in the logical processor. The logical gateway 514 also performs the voting operation, comparing data from the slices before passing the data to the SAN interface.
In an illustrative configuration, each logical processor has a dedicated SAN interface. To avoid failures or bugs that can affect multiple logical processors, redundant execution paths can be implemented to avoid a single failure from disabling multiple logical processors.
The system performs according to a loose lock-step fault tolerance model that enables high availability. The system is tolerant of hardware and many software faults via loosely-coupled clustering software that can shift workload from a failing processor to the other processors in the cluster. The model tolerates single hardware faults as well as software faults that affect only a single processor. The model uses processor self-checking and immediately stopping before faulty data is written to persistent storage or propagated to other processors.
After a failure or a service operation on a multiple-slice system the new processor or processors are reintegrated, including restarting and resynchronizing with the existing running processors. Steps to restore the processor memory state and return to loose lock-step operation with the existing processor or processors is called reintegration. Unlike a logical processor failure, multiple-redundant hardware failures and the subsequent reintegration of the replacement or restarted hardware are not detectable to application software. Reintegration is an application-transparent action that incorporates an additional processor slice into one or more redundant slices in an operating logical processor.
In the reintegration action, all memory and processor state of a running slice is copied to the second slice and both slices continue operation. The basic approach is complicated because reintegration is to have minimal duration, performance, and availability impact on the running processor, for example the reintegration source that continues to execute application programs.
In some embodiments, special hardware called a reintegration link is used to copy memory state from the reintegration source to the target. Since the processor on the reintegration source continues executing application code and updating memory, the reintegration link allows normal operations to modify memory and still have modifications reflected to the reintegration target memory.
Reintegration link hardware can be implemented so that reintegration occurs in groups of one or more logical processors. For example, an implementation can reintegrate an entire slice even if only one processor element of one logical processor is restarted. A reintegration scheme that affects only a single logical processor reduces or minimizes the amount of time a system runs on less than full capability.
Reintegration is triggered by a condition such as a processor slice replacement, receipt of a command for a system management facility, and/or occurrence of an input/output voting error or other error detected by other processor elements in the logical processor. For processor slice replacement, each new processor elements is reintegrated by the currently running logical processors. For a detected error, remaining executing processor elements can reset the faulty processor element and reintegrate. For a transient error, the processor element can be brought back to a fully functional state.
The logical processor that can reintegrate a new processor element determines whether to begin the reintegration process. For reintegration of an entire slice, control resides in the processor complex. In both cases reintegration control is below the system level function. If reintegration fails, the logical processor simply logs the error and continues to attempt the reintegration process. The logical processor may reduce the frequency of reintegration, but continues to try until success.
Referring again to
During normal operation the reintegration logic 410 transparently passes memory operations between the microprocessors 404 and local memory 406. Reintegration link 412 usage is generally limited to scrubbing of latent faults.
During the reintegration operation, reintegration logic 410 at the source processor slice duplicates all main memory write operations, sending the operation both to the local memory 406 and across the reintegration link 412. At the target processor slice, reintegration logic 410 accepts incoming writes from the reintegration link 412 and writes target local memory 406. During reintegration the target does not execute application programs but rather executes a tight cache resident loop with no reads or writes to target local memory 406.
The reintegration link 412 is a one-way connection from one processor slice to one adjacent neighbor processor slice. For a system that includes three processor slices A, B, and C, only slice A can reintegrate slice B, only slice B can reintegrate slice C, and only slice C can reintegrate slice A. Starting from one processor slice with two new processor slices, reintegration can be done in two steps. The first reintegration cycle brings the second processor slice online and the second reintegration cycle brings the third online.
For reintegration of a single processor slice, reintegration logic 410 on the source and target slices are initialized. The logical synchronization unit 414 is set unresponsive to the target, no interrupts are delivered to the target and input/output operations do not include the target. The target is set to accept writes from the reintegration link 412. The target executes an in-cache loop waiting for reintegration to complete. While maintaining local processing, the source reads and writes back all memory local to the source, in an atomic operation since the SAN interface may simultaneously be updating target source memory. A single pass operation reads each cache block from memory and then tags the cache block as dirty with an atomic operation without changing contents. Then, target memory is updated except for the state contained in the remaining dirty cache blocks on the source cache. All processes are suspended and architectural state, for example the processor registers, are stored. Caches are flushed and can be validated. Cache validation is optional but facilitates synchronization of the reintegration source and target after completion of reintegration. Without validation, the source would consistently have a cache hit while the target would miss. All writes are reflected to the target processor, reintegrating the final bytes of dirty cache data into the target processor slice. The logical synchronization unit 414 is enabled to read and write to the target, stop write reflecting, spin in a cache loop, and perform the rendezvous operation. After the rendezvous operation completes, the target should have the exact same state as the source. Architectural state is restored from memory 406, and operation resumes.
In the illustrative implementation, reintegration affects all processor elements in a slice. Alternatively stated, even if a failure in a triple-redundant processor system affects only one processor element of one logical processor, reintegration is performed to the entire memory of the affected processor slice. In the triple-redundant system, the processor complex executes applications with two processor slices active during reintegration so that no loss of data integrity occurs.
In the illustrative implementation, reintegration is performed on an entire processor slice. In other implementations, a single processor element of the processor slice may be reintegrated.
Reintegration enables hardware to recover from a hardware fault without application acknowledgement of the failure. An alternative to reintegration may be to halt the logical processor and allow the application workload to shift to other logical processors. The halted processor can then be restarted, effectively reintegrating the failing processor element in a logical processor.
While the present disclosure describes various embodiments, these embodiments are to be understood as illustrative and do not limit the claim scope. Many variations, modifications, additions and improvements of the described embodiments are possible. For example, those having ordinary skill in the art will readily implement the steps necessary to provide the structures and methods disclosed herein, and will understand that the process parameters, materials, and dimensions are given by way of example only. The parameters, materials, components, and dimensions can be varied to achieve the desired structure as well as modifications, which are within the scope of the claims. Variations and modifications of the embodiments disclosed herein may also be made while remaining within the scope of the following claims. For example, the specific embodiments described herein identify various computing architectures, communication technologies and configurations, bus connections, and the like. The various embodiments described herein have multiple aspects and components. These aspects and components may be implemented individually or in combination in various embodiments and applications. Accordingly, each claim is to be considered individually and not to include aspects or limitations that are outside the wording of the claim.
Number | Date | Country | |
---|---|---|---|
60557812 | Mar 2004 | US |