The present invention relates to implementing support for high-bandwidth random-access memory in a fault-tolerant computer. In particular, the present invention relates to methods and apparatus for supporting high-bandwidth memory in a synchronized multiprocessor computing environment.
Referring to
Typically, system bus 18 has a lower clock rate than CPU 12. Thus, computer performance may also be increased by increasing the clock speed of the bus 18, thereby increasing the throughput of communications carried by the bus 18. One implementation of a high-bandwidth integrated memory subsystem using a bus with a fast clock is the RAMBUS specification from RAMBUS, Inc. of Los Altos, Calif. The RAMBUS system uses a 400 MHz clock with triggered on the rising and falling edges of the clock signal. Therefore, one line in a RAMBUS channel has a bandwidth of 800 Mb/s.
In order to operate at this level of throughput, the operation of components in the RAMBUS subsystem is tightly monitored and periodically adjusted to maintain performance within predetermined tolerances. During these periodic recalibration events, the memory subsystem is not available for memory read or write transactions. In a single processor or asynchronous multiprocessor computer, this recalibration results in a short delay when the memory subsystem is unavailable. However, unsynchronized recalibration events can cause errors in a synchronized multiprocessor computing environment.
Certain prior art computer systems achieve fault tolerance through multiply-redundant system components. Each computer has multiple CPUs, each CPU having its own memory subsystem and other support electronics. The CPUs are cycle-synchronized to run identical copies of the same program simultaneously. Additional logic monitors the output of each CPU at a given point in time and, if the outputs disagree, restarts or initiates a diagnostic sequence to correct or identify the problem. If each CPU is equipped with a high-bandwidth memory subsystem that requires periodic recalibration, then the output of each individual CPU will appear to stall during a recalibration period. If recalibration among multiple memory subsystems is uncoordinated, then during recalibration events the outputs of the CPUs may vary, inducing monitor logic to halt or restart the system. Therefore, it is desirable to implement high-bandwidth memory in a lockstepped multiprocessor computing environment while avoiding delay-induced voter miscompares and other problems.
The present invention relates to the problem of implementing high-bandwidth memory in a multiprocessor computing environment. One object of the invention is to provide methods for synchronized recalibration among multiple hardware devices connected to a memory bus in a fault-tolerant computer. Yet another object of the invention is to provide a fault-tolerant computer with multiple integrated memory subsystems with synchronized memory recalibration.
In one aspect, the present invention is a method for providing synchronized recalibration of hardware devices on a memory bus in a fault-tolerant computing environment. A computer is provided with at least two central processing units (CPUs) and at least two hardware devices, each hardware device associated with one CPU and having a recalibration procedure with a non-zero duration. A deterministically-computed delay is used to simultaneously initiate recalibration among the hardware devices. In one embodiment, a maintenance clock signal is generated with a period substantially equal to the duration between recalibration cycles of the components of the memory subsystem and is used to trigger the deterministically-computed delay. In another embodiment, the initiation of the recalibration procedure occurs when a change in the maintenance clock signal changes the state of a reset signal, in turn initiating a deterministically-computed delay whose lapse initiates the recalibration procedure. In one embodiment, the change in the maintenance clock signal is an edge transition. In another embodiment, the change in the reset signal is a deassertion of the reset signal. In still another embodiment, the change in the reset signal is an assertion of the reset signal. In yet another embodiment, the deterministically-computed delay is an integer multiple of a system clock signal with a system clock period. In another embodiment, the hardware devices are RAMBUS memory controller hubs (MCH). In yet another embodiment, the hardware devices are RAMBUS memory repeater hubs (MRH).
In another aspect, the present invention is a fault tolerant computer with synchronized memory recalibration. The computer includes at least two CPUs in synchronized operation and at least two hardware devices having recalibration procedures, each hardware device connected to a CPU through a memory bus and having a recalibration procedure. The computer also includes a synchronizer connected to the hardware devices and operating to synchronize the execution of the recalibration procedures among hardware devices. In one embodiment, the hardware devices are MCHs. In another embodiment, the hardware devices are MRHs. In yet another embodiment, the computer includes a clock generator connected to the synchronizer, receiving a system clock signal and generating a maintenance clock signal to initiate the recalibration procedure in the hardware devices. In still another embodiment, the computer includes a temperature sensor connected to the synchronizer and thermally connected to the hardware devices, measuring their temperature. In yet another embodiment, the computer includes a current sensor connected to the synchronizer and electrically connected to the hardware devices, measuring the output current from the hardware devices.
These and other advantages of the invention may be more clearly understood with reference to the specification and the drawings, in which:
In the drawings, like reference characters generally refer to the same parts throughout the different views. The drawings are not necessarily to scale, emphasis instead being placed upon illustrating the principles of the invention.
In brief overview, Applicants' invention provides methods and apparatus for implementing high-bandwidth memory subsystems in a multiprocessor computing environment. Each component in the memory subsystem has a recalibration procedure. The computer provides a low-frequency clock signal with a period approximately equal to the duration between recalibration cycles of the components of the memory subsystem. Transitions in the low-frequency clock signal initiate a deterministically-determined delay. Lapse of the delay in turn triggers the recalibration of the components of the memory subsystem, ensuring synchronous recalibration. Synchronizing the recalibration events minimizes the unavailability of the memory subsystems, consequently reducing voting errors between synchronized CPUs.
Although the present invention is discussed in terms of RAMBUS technologies, it is to be understood that the present invention encompasses embodiments using other high-bandwidth memory subsystems whose components require recalibration, including but not limited to double data rate synchronous dynamic RAM (DDRSDRAM).
Referring to
The CPUs 12 are each connected to their own memory controller hub (MCH) 50 via a system bus 18. The MCH 50 operates as a bus master for the memory subsystem, generating requests, controlling the flow of data, and keeping track of RDRAM refresh and states. In a preferred embodiment, the MCH 50 is an 82840 Memory Controller Hub from INTEL CORPORATION of Santa Clara, Calif. The 82840 MCH directly supports dual channels of Direct RAMBUS memory using RAMBUS signaling level technologies. In this configuration, the MCH 50 does not require a memory repeater hub (MRH) 70. In another embodiment, the computer in accord with the present invention uses DDR-SDRAM. In this case, the MCH 50 requires a MRH 70 to implement high-bandwidth memory.
The 82840 MCH provides a memory bandwidth of 3.2 Gb/sec. To achieve this data rate, it is necessary to maintain RAMBUS channel parameters, such as device output current and temperature, within certain predetermined ranges. For example, in embodiments using the MRH 70, the MRH 70 requires periodic recalibration to maintain its temperature within a predetermined operating range. Periodically the temperature of the MRH 70 is measured and the slew rate of the output drivers of the MRH 70 are adjusted to correct any temperature drift. In one embodiment using the Intel 840 chipset, this recalibration procedure takes 350 nanoseconds, making the memory subsystem unavailable for read and write transactions during this period.
Since each recalibration period renders a memory subsystem unavailable for use, a CPU 12 can appear to halt operation while it waits for its memory subsystem to finish recalibration. If the CPUs of the computer are designed to operate in lockstep, then unsynchronized recalibrations among memory subsystems will cause individual processors to periodically stall, resulting in different output streams among the CPUs. If additional logic is present to compare the CPU outputs to facilitate error detection, then unsynchronized recalibrations will incorrectly appear as errors, although in principle the system may be operating correctly.
This problem is addressed by the addition of a synchronizer 76 to generate signals to initiate the recalibration of the memory subsystem components in a controlled fashion. The synchronizer 76 receives a low-frequency maintenance clock signal. The period of the maintenance clock signal is substantially equal to the duration between recalibration cycles of the components of the integrated memory subsystem. In some embodiments, the maintenance clock signal has a period that is an integer multiple of a higher-frequency system clock signal.
Transitions in the maintenance clock signal initiate a predetermined deterministically-computed delay. Lapse of that delay in turn initiates the assertion of a reset signal routed to the MCH 50 and in turn to the CPUs 12, initiating a coordinated recalibration among the components of the integrated memory subsystems. Coordinated memory recalibration ensures that the integrated memory subsystems are simultaneously unavailable for the same amount of time, minimizing the disruption in a synchronous multiprocessor computing environment.
The RESET signal is asserted under two conditions. First, the RESET signal is asserted in the event of power-on or a system restart after the passage of a predetermined delay determined by timer 90. Second, the RESET signal is asserted upon the application of an asserted signal to the line to the control interface 80. The control interface line may be asserted by hardware or software to initiate a synchronized recalibration of the components of the integrated memory subsystems. In one embodiment, an individual hardware device initiates a recalibration event in all of the integrated memory subsystems by placing a signal on this line.
After the next RESET signal is asserted (Step 22), a deterministic delay passes before the recalibration cycle in the next memory subsystem is initiated (Step 24). At this point, the recalibration cycles between the memory subsystems have been synchronized. The process repeats itself for the remaining CPUs (Step 26) before normal system operation ensues (Step 28). In normal operation, recalibration among memory subsystems continues to operate synchronously.
Many alterations and modifications may be made by those having ordinary skill in the art without departing from the spirit and scope of the invention. Therefore, it must be expressly understood that the illustrated embodiment has been shown only for the purposes of example and should not be taken as limiting the invention, which is defined by the following claims. The following claims are thus to be read as not only literally including what is set forth by the claims but also to include all equivalent elements for performing substantially the same function in substantially the same way to obtain substantially the same result, even though not identical in other respects to what is shown and described in the above illustrations.
Number | Name | Date | Kind |
---|---|---|---|
3460094 | Pryor | Aug 1969 | A |
3468241 | Leslie et al. | Sep 1969 | A |
3469239 | Richmond et al. | Sep 1969 | A |
3544973 | Borck, Jr. et al. | Dec 1970 | A |
3548382 | Lichty et al. | Dec 1970 | A |
3609704 | Schurter | Sep 1971 | A |
3641505 | Artz et al. | Feb 1972 | A |
3705388 | Nishimoto | Dec 1972 | A |
3710324 | Cohen et al. | Jan 1973 | A |
3736566 | Anderson et al. | May 1973 | A |
3795901 | Boehm et al. | Mar 1974 | A |
3805039 | Stiffler | Apr 1974 | A |
3893084 | Kotok et al. | Jul 1975 | A |
4015246 | Hopkins, Jr. et al. | Mar 1977 | A |
4040034 | Belady et al. | Aug 1977 | A |
4096572 | Namimoto | Jun 1978 | A |
4164787 | Aranguren | Aug 1979 | A |
4228496 | Katzman et al. | Oct 1980 | A |
4296463 | Dalboussiere et al. | Oct 1981 | A |
4356550 | Katzman et al. | Oct 1982 | A |
4365295 | Katzman et al. | Dec 1982 | A |
4366535 | Cedolin et al. | Dec 1982 | A |
4466098 | Southard | Aug 1984 | A |
4484273 | Stiffler et al. | Nov 1984 | A |
4493036 | Boudreau et al. | Jan 1985 | A |
4503499 | Mason et al. | Mar 1985 | A |
4574348 | Scallon | Mar 1986 | A |
4590554 | Glazer et al. | May 1986 | A |
4608631 | Stiffler et al. | Aug 1986 | A |
4608688 | Hansen et al. | Aug 1986 | A |
4637024 | Dixon et al. | Jan 1987 | A |
4672613 | Foxworthy et al. | Jun 1987 | A |
4674037 | Funabashi et al. | Jun 1987 | A |
4677546 | Freeman et al. | Jun 1987 | A |
4695975 | Bedrij | Sep 1987 | A |
4700292 | Campanini | Oct 1987 | A |
4716523 | Burrus, Jr. et al. | Dec 1987 | A |
4719568 | Carrubba et al. | Jan 1988 | A |
4774659 | Smith et al. | Sep 1988 | A |
4866604 | Reid | Sep 1989 | A |
4920540 | Baty | Apr 1990 | A |
4924427 | Savage et al. | May 1990 | A |
4942517 | Cok | Jul 1990 | A |
4942519 | Nakayama | Jul 1990 | A |
4965717 | Cutts et al. | Oct 1990 | A |
4993030 | Krakauer et al. | Feb 1991 | A |
5020024 | Williams | May 1991 | A |
5115490 | Komuro et al. | May 1992 | A |
5175855 | Putnam et al. | Dec 1992 | A |
5193162 | Bordsen et al. | Mar 1993 | A |
5193180 | Hastings | Mar 1993 | A |
5195040 | Goldsmith | Mar 1993 | A |
5231640 | Hanson et al. | Jul 1993 | A |
5276860 | Fortier et al. | Jan 1994 | A |
5280612 | Lorie et al. | Jan 1994 | A |
5280619 | Wang | Jan 1994 | A |
5283870 | Joyce et al. | Feb 1994 | A |
5295258 | Jewett et al. | Mar 1994 | A |
5317726 | Horst | May 1994 | A |
5321706 | Holm et al. | Jun 1994 | A |
5335334 | Takahashi et al. | Aug 1994 | A |
5357612 | Alaiwan | Oct 1994 | A |
5371885 | Letwin | Dec 1994 | A |
5386524 | Lary et al. | Jan 1995 | A |
5388242 | Jewett | Feb 1995 | A |
5404361 | Casorso et al. | Apr 1995 | A |
5423037 | Hvasshovd | Jun 1995 | A |
5423046 | Nunnelley et al. | Jun 1995 | A |
5426747 | Weinreb et al. | Jun 1995 | A |
5434997 | Landry et al. | Jul 1995 | A |
5440710 | Richter et al. | Aug 1995 | A |
5440727 | Bhide et al. | Aug 1995 | A |
5440732 | Lomet et al. | Aug 1995 | A |
5454091 | Sites et al. | Sep 1995 | A |
5463755 | Dumarot et al. | Oct 1995 | A |
5465328 | Dievendorff et al. | Nov 1995 | A |
5475860 | Ellison et al. | Dec 1995 | A |
5479648 | Barbera et al. | Dec 1995 | A |
5497476 | Oldfield et al. | Mar 1996 | A |
5504873 | Martin et al. | Apr 1996 | A |
5513314 | Kandasamy et al. | Apr 1996 | A |
5524212 | Somani et al. | Jun 1996 | A |
5550986 | DuLac | Aug 1996 | A |
5551020 | Flax et al. | Aug 1996 | A |
5557770 | Bhide et al. | Sep 1996 | A |
5555404 | Torbjørnsen et al. | Oct 1996 | A |
5566316 | Fechner et al. | Oct 1996 | A |
5568629 | Gentry et al. | Oct 1996 | A |
5581750 | Haderle et al. | Dec 1996 | A |
5584008 | Shimada et al. | Dec 1996 | A |
5584018 | Kamiyama | Dec 1996 | A |
5586253 | Green et al. | Dec 1996 | A |
5586310 | Sharman | Dec 1996 | A |
5606681 | Smith et al. | Feb 1997 | A |
5608901 | Letwin | Mar 1997 | A |
5619671 | Bryant et al. | Apr 1997 | A |
5627961 | Sharman | May 1997 | A |
5628023 | Bryant et al. | May 1997 | A |
5632031 | Velissaropoulos et al. | May 1997 | A |
5651139 | Cripe et al. | Jul 1997 | A |
5664172 | Antoshenkov | Sep 1997 | A |
5682513 | Candelaria et al. | Oct 1997 | A |
5687392 | Radko | Nov 1997 | A |
5721918 | Nilsson et al. | Feb 1998 | A |
5724581 | Kozakura | Mar 1998 | A |
5742792 | Yanai et al. | Apr 1998 | A |
5745913 | Pattin et al. | Apr 1998 | A |
5754821 | Cripe et al. | May 1998 | A |
5784699 | McMahon et al. | Jul 1998 | A |
5794035 | Golub et al. | Aug 1998 | A |
5815649 | Utter et al. | Sep 1998 | A |
5838894 | Horst | Nov 1998 | A |
5845060 | Vrba et al. | Dec 1998 | A |
5850632 | Robertson | Dec 1998 | A |
5860126 | Mittal | Jan 1999 | A |
5890003 | Cutts et al. | Mar 1999 | A |
5894560 | Carmichael et al. | Apr 1999 | A |
5918229 | Davis et al. | Jun 1999 | A |
5920876 | Ungar et al. | Jul 1999 | A |
5920898 | Bolyn et al. | Jul 1999 | A |
5933838 | Lomet | Aug 1999 | A |
5949972 | Applegate | Sep 1999 | A |
5953742 | Williams | Sep 1999 | A |
5956756 | Khalidi | Sep 1999 | A |
5959923 | Matleson et al. | Sep 1999 | A |
5960459 | Thome et al. | Sep 1999 | A |
5990914 | Horan et al. | Nov 1999 | A |
5996055 | Woodman | Nov 1999 | A |
6000007 | Leung et al. | Dec 1999 | A |
6012106 | Schumann et al. | Jan 2000 | A |
6012120 | Duncan et al. | Jan 2000 | A |
6016495 | McKeehan et al. | Jan 2000 | A |
6021456 | Herdeg et al. | Feb 2000 | A |
6026465 | Mills et al. | Feb 2000 | A |
6026475 | Woodman | Feb 2000 | A |
6047343 | Olarig | Apr 2000 | A |
6055617 | Kingsbury | Apr 2000 | A |
6065017 | Barker | May 2000 | A |
6067550 | Lomet | May 2000 | A |
6067608 | Perry | May 2000 | A |
6085200 | Hill et al. | Jul 2000 | A |
6085296 | Karkhanis et al. | Jul 2000 | A |
6098074 | Cannon et al. | Aug 2000 | A |
6105075 | Ghaffari | Aug 2000 | A |
6119128 | Courter et al. | Sep 2000 | A |
6119214 | Dirks | Sep 2000 | A |
6128711 | Duncan et al. | Oct 2000 | A |
6128713 | Eisler. | Oct 2000 | A |
6134638 | Olarig et al. | Oct 2000 | A |
6138198 | Garnett et al. | Oct 2000 | A |
6141722 | Parsons | Oct 2000 | A |
6141744 | Wing So | Oct 2000 | A |
6141769 | Petivan et al. | Oct 2000 | A |
6263452 | Jewett et al. | Jul 2001 | B1 |
6687851 | Somers et al. | Feb 2004 | B1 |
6757847 | Farkash et al. | Jun 2004 | B1 |
Number | Date | Country |
---|---|---|
0 208 430 | Jan 1987 | EP |
0 428 330 | May 1991 | EP |
0 428 330 | May 1991 | EP |
0 406 759 | Sep 1991 | EP |
0 475 005 | Nov 1995 | EP |
0 390 567 | Sep 1999 | EP |
2 508 200 | Dec 1982 | FR |
Number | Date | Country | |
---|---|---|---|
20020124202 A1 | Sep 2002 | US |