The subject matter which is regarded as the invention is particularly pointed out and distinctly claimed in the claims at the conclusion of the specification. The foregoing and other objects, features, and advantages of the invention are apparent from the following detailed description taken in conjunction with the accompanying drawings in which:
The detailed description explains the preferred embodiments of the invention, together with advantages and features, by way of example with reference to the drawings.
We implement our invention with concurrent selftest hardware provided with the system which contains of two major pieces of hardware: selftest engine and priority logic. When concurrent selftest is needed, the hardware selftest engine is first setup by firmware. Generally, the starting and ending addresses, address mode, and data mode are initialized. After the setup under the firmware the selftest engine will start sending fetch and store commands to the priority logic in the background. The priority logic will take the commands from the selftest engine and regular mainline traffic, prioritize them, and send them sequentially over to the Processor Memory Arrays (PMA) section of the memory sub-system.
Turning now to the drawings in greater detail, it will be seen that in
In the z9-109 implementation, the MSC (Main Storage Controller) chip has an X port and a Y side each independently controlling a PMA. Within the hardware we have provided a plurality of ports for a memory region of the global system storage, each of these ports has a concurrent selftest engine which is assigned to test a memory region within a set of DRAMs on the PMA to which it is assigned. There are X and Y ports of the Controller which operate independently, and the engines in both the X and Y ports can be operating in parallel as well. There are two MSC chips to a node, and both MSC chips in a node can be operating in parallel, as can the 4 nodes in a system. That adds up to 16 selftest engines running allocated to memory regions concurrently to quickly verify the quality of the pre-allocated extended memory. See the illustrations described below with respect to the Figures for the selftest hardware engines in a system.
Concurrent Selftest Engine
Below are the detailed explanations of each of the components The concurrent selftest engine is the core of the hardware employed to test and repair memory and to dynamically allocate or de-allocate memory regions because of customers' demands. Once setup, it will generate the memory fetch and store commands to the priority logic. For memory stores, the selftest engine can use fixed or random data patterns. For memory fetches, the hardware memory selftest engine (in a manner different from the selftest engine of U.S. Pat. No. 5,003,048) will check the data validity either by bit comparing or by ECC checking, and update the selftest status based on the results.
The firmware implements the following setup parameters which are used in a hardware concurrent selftest engine. The settings are divided into 4 categories.
Address Control Parameters
1. Starting Address
It defines starting address of the extended memory region that the concurrent selftest will be working on.
2. Ending Address
It defines ending address of the extended memory region that the hardware concurrent selftest engine will be working on.
4. LICCC (Licensed Internal Code Configuration Control) Address
It defines the upper limit of the customer address space. It is used as a control to prevent any selftest accesses from entering the customer's address range. Any setup error or internal control error results in a specification error status being posted to the firmware.
Data Control Parameters
4. Data Generation Mode
For selftest writes, the firmware setup controls the data and requires it to be either fixed data pattern or random data pattern. In fixed data pattern mode, the data generated will be from Data Pattern Parameter. In random data pattern mode, the data will be calculated by a random data generator.
5. Data ECC Mode: ECC/Compare Mode
The firmware defines the way data is sent to or returned from memory. In ECC mode, the data will be transferred along with an ECC code. On a fetch operation the fetch ECC station will check the ECC results. In compare mode, the data will be transferred as 144 bit data without ECC. On a fetch operation, the data is compared against a known data pattern to verify its validity,
6. Data Pattern
The data control parameter holds the implemented data pattern. It is used in fixed data pattern mode and also used as the starting point by the random data generator in random data mode.
7. Random Data Generation Mask
The random data generation mask is used by the random data generator to generate random data patterns.
Operation Sequence Control Parameters
8. Gap Control
The firmware gap control is used to introduce artificial gaps between the commands that the hardware memory selftest engine sends to memory. This would affect the data bandwidth that the engine uses comparing to the overall memory data bandwidth. This would affect the system performance since the concurrent selftest engine shares the same memory and memory ports with mainline function. In concurrent mode, speed in completing the testing is generally not a factor. Thus, the gap is generally set fairly large to limit the data bandwidth usage.
9. Start/Stop Bits
Start/stop bits are the main switch to turn on/off the selftest engine.
Status and Error Reporting Registers
10. Status Register
A status register stores the current status of the concurrent selftest engine and the overall testing results. Firmware can poll this register periodically to watch the selftest progress and check the overall selftest results.
11. Bit Error Counters
Each data bit has a corresponding bit error counter that keeps track of how many errors have occurred during the memory selftest. During the concurrent selftest in compare mode, should miscompares occur, the selftest engine will increment the count for the corresponding bit. In ECC mode, the counters also increment when data CE is detected.
Priority Logic
The main function of the hardware priority logic is to merge the memory command stream from selftest engine with the mainline memory command stream together. The priority logic can be programmed to treat the selftest commands with normal priority or lower priority.
In normal priority mode, the priority logic will treat both selftest command and mainline command in the same manner. The commands arc basically executed based on the availability of the DRAM banks only. The memory bandwidth used by selftest commands is mainly controlled by the ‘gap control’ parameter of the selftest engine,
In low priority mode, the priority logic will give the selftest command lower weight than the regular mainline commands. The selftest command will only be executed if there are no outstanding mainline commands pending. This will minimize the performance impact that concurrent selftest posts would cause on the mainline memory operations.
The other function that the priority logic provides is by the hardware that handles the memory bank/rank conflicts. Traditionally, all the incoming mainline commands are targeting different memory banks by design. However, we have added memory selftest commands in background could target a memory bank that is currently being used by regular mainline commands. When such a conflict occurs, the priority logic will delay sending out the later coming command until its target memory bank is freed.
Firmware
Firmware is the driving force for the concurrent selftest. Basically, when such selftest is needed, firmware first sets up the selftest engine with parameters detailed in the above section. Once the concurrent selftest is initiated, all the hardware memory selftest engines on each memory port run in parallel. The firmware periodically polls the selftest status. Once all the engines finished the tests on its own memory port, the firmware can retrieve all the error status information out and takes indicated and proper actions based on the results, e.g. sparing the DRAM chips and other operations.
Applications
The central storage regions can be categorized as follow. The selftest engine will mainly work on the inactive regions and the unassigned regions, once the system storage configuration is changed on-demand by the customer.
The performance gets boosted substantially since the activities are done by hardware and no firmware code is involved during the selftest execution. The concurrent selftest engine can be used for the following scenarios:
1. Concurrently Verify/Test the Newly Allocated Memory Region
(Once the new memory is allocated, the concurrent selftest is performed to verify the memory content has any defects or not. This has performance advantage over the existing implementation.)
2. Concurrently Initialize the Newly Allocated Memory Region per Architecture.
(Once the newly allocated the memory has tested defect-free, the memory needs to be initialized with a certain data pattern before being turned over to customer usage. The data pattern is determined per system architecture.)
3. Concurrently Clear an Unused Memory Region that Application is no Longer Active.
(For data security reason, the concurrent selftest can be used to clear a chunk of memory with a fixed data pattern thus erasing all the leftover customer information.)
4. Concurrently Scramble an Unused Active Memory Region
(This new capability is useful for data security. The concurrent selftest can be used to clear a chunk of memory with a random data pattern thus erasing all the leftover customer information.)
The capabilities of the present invention are and can be implemented in software, firmware and hardware as a combination thereof using the hardware memory selftest engine.
As one example, one or more aspects of the present invention can be included in an article of manufacture (e.g., one or more computer program products) having, for instance, computer usable media for implementing the invention. The media has embodied therein, for instance, computer readable program code means for providing and facilitating the capabilities of the present invention. The article of manufacture can be included as a part of a computer system or sold separately.
Additionally, at least one program storage device readable by a machine, tangibly embodying at least one program of instructions executable by the machine to perform the capabilities of the present invention can be provided.
The flow diagrams depicted herein are just examples. There may be many variations to these diagrams or the steps (or operations) described therein without departing from the spirit of the invention. For instance, the steps may be performed in a differing order, or steps may be added, deleted or modified. These steps can be provided as a service to the customer. All of these variations are considered a part of the claimed invention.
While the preferred embodiment to the invention has been described, it will be understood that those skilled in the art, both now and in the future, may make various improvements and enhancements which fall within the scope of the claims which follow. These claims should be construed to maintain the proper protection for the invention first described.