Computing with both lock-step and free-step processor modes

Abstract
A computer system provides for both lock-step and free-step processor modes, allowing for an effective tradeoff between performance and data integrity.
Description
BACKGROUND OF THE INVENTION

The present invention relates to computers and, more particularly, to error-handling in computers. In this specification, related art labeled “prior art” is admitted prior art; related art not labeled “prior art” is not admitted prior art.


Applications for which errors are unacceptable can be run on computers that detect and address errors that inevitably occur. For example, parity bits or error-correction code bits can be added to data being communicated so that errors can be detected and, in some cases, corrected. If the errors cannot be corrected, often the data can be regenerated or retransmitted. However, it is generally impractical to employ error detection and error correction extensively within a processor. Accordingly, data corruptions occuring with in a processor are often undetected, i.e., “silent”.


Such silent data corruption can be addressed by running two or more processors in lock-step. In other words, the same program is run on two processors. The processor outputs can then be compared with differences being used to indicate errors, with various approaches being available for addressing the detected errors.


SUMMARY OF THE INVENTION

As defined in the claims, the present invention provides for both lock-step and free-step (normal, non-lockstep) operation in the same computer system. Various embodiments of the invention include systems in which the step mode for each processor is fixed, in which step modes can be configured (e.g., upon boot-up), and systems in which the step mode is dynamically reallocable. In each case, the invention provides for more optimal trade-offs between error detection and performance, as operating processors in free-step mode generally provides greater performance at the risk of greater vulnerability to errors. These and other features and advantages of the invention are apparent from the description below with reference to the following drawings.




BRIEF DESCRIPTION OF THE DRAWINGS


FIG. 1 is a schematic illustration of one of many possible computer systems in accordance with the present invention.



FIG. 2 is a flow chart of one of many possible methods in accordance with the present invention.




DETAILED DESCRIPTION

A computer system AP1 shown in FIG. 1 includes four processors P11, P12, P21, P22. Processors P11 and P12 interface to the rest of system AP1 via a core electronics component (CEC) 11, while processors P21 and P22 interface to the rest of the system via a different core electronics component 12. In effect, core electronic components 11 and 12 define two processor sets, one including processors P11 and P12, and the other including processors P21 and P22. In other embodiments, different numbers of processors can be used and they can be assigned to sets in a variety of ways. In some embodiments, the set of processors associated with a core electronic component define a true system partition with the conventional constraints regarding the processors, the operating system, and resources; in other embodiments, such constraints do not apply.


Core electronic component 11 includes loss-of-lockstep logic 21 and interface logic 31; similarly, core electronic component 12 includes loss-of-lockstep logic 22 and interface logic 32. Interface logics 31 and 32 provide for the convention functions of core electronic components, e.g., interfacing processors with the rest of a system. When a set of processors are in lock-step mode, data from the processors is directed to the respective loss-of-lockstep logic (LOL) 21 or 22. Loss-of-lockstep logics initiate error handling procedures when an error is detected as a loss of lockstep, e.g., when the associated processors provide different outputs for the same inputs. When the associated processors are in free-step mode, data from the processors bypasses the associated loss-of-lockstep logic and the core electronics components acts conventionally. Core electronics components 11 and 12 and in particular loss of lockstep logics 21 and 22 include inputs for controlling the step mode, e.g., as directed by operating system 44, per configuration instructions setup by the user of the system partition running that operating system.


In system AP1, core electronic components 11 and 12 are defined on respective substrates distinct from substrates bearing processors P11, P12, P21, P22. In alternative embodiments, various combinations of processors and core electronic components can be formed on common substrates.


As illustrated, core electronic components 11 and 12 interface with the rest of system AP1 via a bus 41. More generally, the interfacing can be done via a combination of buses or a network fabric. The rest of system AP1 includes, among other components, input-output channels 42 and memory 43. Memory 43 includes random access memory, hard disk storage, and other storage media.


Memory 43 stores an operating system 44, processes 45, a configuration database 46, and data 47. Database 46 contains basic lockstep configuration data including: 1) indicating which CPUs are to be in lock-step mode and which should be in free-step mode; 2) some ‘rules’ which can dynamically change this assignment; and 3) a list of processes that should run in lock-step mode. Processes are run in free-step mode by default unless the configuration database indicates otherwise.


Another embodiment could have this default setting reversed. Some of processes 45 belong to the operating system, while others belong to applications. When a process is called, operating system 44 checks configuration database 46 to determine whether to assign that process to a set of processors in lock-step mode or a processor in free-step mode. Some processes may require assignment to a lock-step mode, while others may require free-step mode. For other processes, database 46 can indicate criteria for assigning a process to one step mode or the other. For example, resource utilization and/or time of day data can be factors in determining whether to assign a process to a lock-step processor or a free-step processor.


When system AP1 is booted, operating system 44 checks database 46 for system configuration data to determine which processors are to be assigned to lock-step mode and which are to be assigned to free-step mode. In system AP1, this assignment is done on a set-by-set basis. While system AP1 is running, operating system 44 monitors resource utilization and, if configuration data permits, writes to database 46 to indicate an assignment of step modes to be assumed the next time system AP1 is booted. This feature is most useful for systems with many sets of processors.


If so configured, operating system 44 can also use resource utilization data to dynamically switch processor step modes. For example, when there is excess demand for data integrity, both processor sets can be set to lock step mode. On the other hand, when greater performance is required and data integrity is less critical, both sets can be assigned to free-step mode.


Some of these capabilities are used in method M1, flow-charted in FIG. 2. Operating system 44 can call a process at method segment S11. At method segment S12, operating system 44 checks database 46 to determine the step mode for the called process. Lock-step mode may be required, free-step mode may be required, or some criteria can be specified for determining the step mode for the process. For example, lock-step mode may be favored at night, when utilization is relative low and performance is less critical. Also, a process may prefer one mode but allow the other mode when resource utilization data favors the other mode. Depending on the outcome of the step-mode determination, the process can be assigned to a processor in lock step mode at method segment S13 or to a processor in free-step mode at step S14.


Method segments S21-S24 operate concurrently with method segments S11-S24. In method segment S21, operating system 44 monitors resource allocation between processors in lock-step mode versus processors in free-step mode. At method segment S22, a determination is made whether the resource utilization is sufficiently balanced. If one mode is stressed, the processor sets can be reallocated at method segment S23; for example, operating system 44 can signal a core electronics component to place its set of processors in the selected step mode. If supply and demand for the step modes are reasonably balanced, the processor modes can be maintained at method segment S24. Depending on the system configuration, the reallocation can take effect upon restart or can be implemented dynamically. Alternatively, system AP1 can be configured to preclude reallocation of step modes.


Reallocation method segments S21-S24 can interact with assignment method segments S11-S14. In particular, resource allocation data generated in method segment S21 can be used in mode assignment method segment S12 to determine which mode a process should be assigned to, e.g., when the configuration data for that process indicates the mode should favor less utilized resources. Likewise, assignment determinations at method segment S12 can be used as raw data in monitoring resource utilization at method segment S21.


The invention provides for embodiments with as few as two processors. In such a case, both processors can operate in lock-step mode or both can operate in free-step, mode. Systems with three processors have more possibilities. One pair can operate in lock-step mode while the third operates in free-step mode. The pair can be fixed or formed from different combinations of processors. Also, all three can operate in lock-step mode in some circumstances. Greater numbers of processors offer more possibilities. In practice, the association of processors to core electronic components limits the combinations of processors that can be run together in lock step. These and other variations upon and modifications to the described embodiment are provided for by the present invention, the scope of which is defined by the following claims.

Claims
  • 1. A computer system comprising: a set of processors including a first processor for operating at least some of the time in lock-step mode; a second processor for operating at least some of the time in free-step mode; and allocating means for allocating processes a processors when said second processor is in said free-step mode.
  • 2. A computer system as recited in claim 1 wherein said allocating means causes said second processor to operate in said lock-step mode when said first processor operates in said lock-step mode.
  • 3. A computer system as recited in claim 1 further comprising a third processor that operates in said lock-step mode when said first processor operates in said lock-step mode.
  • 4. A computer system as recited in claim 1 wherein said allocating means switches at least one of said processors from one of said modes to the other.
  • 5. A computer system as recited in claim 4 further comprising resource allocation means for acquiring resource utilization data about processors in said lock-step mode and about processors in said free-step mode, said allocation means switching a processor from one of said modes to the other at least in part as a function of said resource utilization data.
  • 6. A computer system as recited in claim 1 further comprising computer readable media, said media storing a configuration database for storing information indicating which of said processors of said set are assigned to said lock-step mode and which are assigned to said free-step mode, said media also storing an operating system that serves as said allocating means, and indicating which of said processes should be assigned to processors assigned to said lockstep mode and which of said processes should be assigned to said free-step mode; said allocating means allocating said processes among said processors of said set as a function of said information.
  • 7. A computer system as recited in claim 6 wherein said information includes criteria to be applied in determining whether a process is to be assigned to said lock-step mode or said free-step mode.
  • 8. A computer system as recited in claim 6 wherein said allocating means is an operating system stored in said media.
  • 9. A method comprising operating a multiprocessor computer so that a first processor operates at least some of the time in lock-step mode and a second processor operates at least some of the time in free-step mode.
  • 10. A method as recited in claim 9 further comprising: an operating system accessing a configuration database for processor-mode information indicating which processes should be run in said lock-step mode and which processes should be run in said free-step mode, said operating system assigning processes to processors at least in part as a function of said process-mode information.
  • 11. A method as recited in claim 10 wherein said database holds processor-mode information indicating which processors are operating in said lock-step mode and which processors are operating in said free-step mode, said operating system assigning processes to processors at least in part as a function of said processor-mode information.
  • 12. A method as recited in claim 10 wherein said processor-mode information includes criteria to be applied in determining which step mode a process should run in.
  • 13. A method as recited in claim 12 wherein said criteria include resource utilization as a function of processor step mode.
  • 14. A method as recited in claim 9 further involving running said processors in the same mode at the same time.
  • 15. A method as recited in claim 14 further comprising switching said processor from one of said modes to the other.
  • 16. A method as recited in claim 9 further comprising operating a third processor in lock-step with said first processor when said first processor is in said lock-step mode.
  • 17. A method as recited in claim 9 further comprising: acquiring resource utilization data about a set of processors running in said lock-step mode and resource utilization about a set of processors running in said free-step mode; and switching at least one of said processor from one of said modes to the other at least in part as a function of said resource utilization data.
  • 18. A core electronics component comprising: an input for receiving a command to enter a lockstep mode and a command to enter a free-step mode. interface logic for interfacing a set of processors to other components of a host system; and loss of lockstep logic for determining, while in said lockstep mode, when a loss of lockstep occurs.
  • 19. A core electronics component as recited in claim 18 wherein said command to enter said free-step mode disables said loss of lockstep logic.
  • 20. A core electronics component as recited in claim 19 wherein in said free-step mode, outputs of said processors are input to said interface logic.
  • 21. A core electronics component as recited in claim 18 wherein, said command to enter said lockstep mode causes outputs of said processors to be compared by said loss of lockstep logic.
  • 22. A core electronics component as recited in claim 21 wherein in said lockstep mode, outputs of said processors are input to said interface logic.