Embodiments of the invention relate to the field of microprocessor architecture. More particularly, embodiments of the invention relate to various techniques for writing data from a bus agent to a processor cache without having to first write the data to memory and then having the processor read the data from the memory.
Typically, bus agents residing in a computer system have had to first write (“push”) data to a location in a memory device, such as dynamic random access memory (DRAM), such as main memory, or static RAM (SRAM) location, such as a level-2 (L2) cache, external to the processor or processors for which the data is intended. The target processor or processors would then have to read the data from the memory location, incurring read cycles that can hamper processor and system performance.
The target processor(s) typically store the retrieved data into an internal cache within the processor, such as a level-1 (L1) cache. Prior art techniques have, therefore, been developed to write the target data from the external agent to the processor's internal cache directly (i.e. without first writing the data to memory and later retrieved by the target processor). In multi-processor systems, it may be necessary for cache coherency to be maintained among the processors in the system.
Prior art techniques have been developed to address the coherency problem for multi-processor systems by for, example, specifying a fixed target processor address encoded by the pushing agent driven onto the interconnect between the external agent and the target processor(s), dynamically selection of the target processor(s) by the external agent, and simply treating all processors in the system as targets such that the data is always written to each processor's internal cache. However, prior art techniques require the external agent, or “pushing” agent, to be aware of, such things as how many processors are in the system at any given time, how to address each processor, etc.
In applications, such as those using symmetric processing, in which push data may not be associated with a specific processor, or dynamically configurable systems, in which the processor resources may change in number and/or address, or other applications in which sufficient information about the processors in the system may not be available and/or it is not desirable to write data to all processors in the system, the prior art methods for directly writing data to a processor or processors while maintaining cache coherency between the processors may not provide the best solution. In general, prior art techniques for writing data directly from a bus agent to a processor's internal cache while maintaining cache coherency with other processors or agents within the system has been largely push agent-focused, in that it is the responsibility of the writing bus agent to maintain coherency among the target processors or agents.
Requiring the push agent to maintain coherency can be limiting in the number of applications available to direct data pushing techniques, such as those previously discussed.
Embodiments and the invention are illustrated by way of example and not limitation in the figures of the accompanying drawings, in which like references indicate similar elements and in which:
Embodiments of the invention described herein relate to multi-processor systems. More particularly, embodiments of the invention described herein relate to techniques to write data from a bus agent within a multi-processor computer system to one or more processors within the system without having to first write the data to a memory location external to the target processor(s) from which the data may be retrieved by the target processor(s). Furthermore, embodiments of the invention described herein relate to techniques for pushing data from a bus agent to at least one processor within a multi-processor system, in which the processor(s) are at least partially responsible for arbitrating the target of the push data and for maintaining cache coherency between the various processors within the system.
As multi-processor system become more complex and diverse, the need for decentralizing the arbitration of push data and cache coherency becomes important in direct-push system architectures. Fortunately, embodiments described herein may be used in any number of multi-processor system configurations, including those of the prior art, while allowing for greater flexibility in designs of these systems. Two general computer system architectures are described in this disclosure by way of example—a shared bus architecture (or “front-side bus” architecture) and a point-to-point (PtP) system architecture. However, embodiments of the invention are not limited to these computer systems, and may be readily used in any number of multi-processor computer systems in which data is pushed directly to the processor(s) within the system rather than first being stored to a memory external to the processors from which the processor(s) may retrieve the data.
Illustrated within the processor of
The main memory may be implemented in various memory sources, such as dynamic random-access memory (DRAM), a hard disk drive (HDD) 320, or a memory source located remotely from the computer system via network interface 330 containing various storage devices and technologies. The cache memory may be located either within the processor or in close proximity to the processor, such as on the processor's local bus 307. Furthermore, the cache memory may contain relatively fast memory cells, such as a six-transistor (6T) cell, or other memory cell of approximately equal or faster access speed.
The computer system of
The system of
At least one embodiment of the invention may be located within the PtP interface circuits within each of the PtP bus agents of
Processors 507 and 510 respond with a signal indicating that they are candidates to receive the push data from the pushing agent. In other embodiments one, none, or more processors may respond. In shared bus systems, a processor may respond to the push request by driving a “push target candidate” (PTC) signal on the bus during a shared bus signaling phase, such as a response phase. In a PtP computer system, a processor may respond to the push request by issuing a PTC message from the processor to the pushing agent or push arbiter.
The decision for whether a processor responds may be based off a number of criteria. For example, the processor(s) may respond based on whether it has the push data already cached, whether the processor(s) has/have enough resources, such as buffer and/or queue space, available to process the push request, whether the push request matches against a push address range designated within the processor, or whether there are competing accesses for shared cache, buffer, or queue resources. In other embodiments other criteria may determine whether a processor responds as a candidate to receive the push data, including whether accepting the data will cause data within a processor's cache to be replaced.
Once each processor has indicated that it is a candidate to receive the push data, the choice of which processor(s) to which the data is to be sent is arbitrated. In one embodiment, the push arbitration is done by the push agent itself. In other embodiments, the push arbitration is done by a separate push arbiter or within one or more of the processors. Yet, in other embodiments, the arbitration may be distributed throughout the pushing agent, the processors, and/or a push arbiter. Any arbitration scheme may be used in determining the appropriate recipient processor(s) for the push data. For example, in one embodiment, the arbitration scheme is “round-robin” scheme, in which each processor in the system receives data in a particular order. Furthermore, a static priority arbitration scheme may be used, in which a priority among the processors is maintained for each push. Still, in other embodiments, other arbitration schemes may be used.
In the event that no processors respond as candidates to receive the push data, various embodiments of the invention may use varying techniques to deal with this situation. For example, in one embodiment of the invention, at least one processor is guaranteed to respond as a candidate to receive the data. In another embodiment, the pushing agent or push arbiter chooses one of the processors to accept the data. In another embodiment, the default recipient is always the memory controller, which can then write the push data to memory external to the processor(s), such as DRAM. However, in other embodiments, the push may simply be aborted. Other arbitration schemes may be used in other embodiments in the event that no processor responds as a candidate to receive the push data.
In one embodiment of the invention, the processor(s) to receive the push data is notified by a signal from the pushing agent. In a shared bus system, the notification may be done by driving a “selected push target” (SPT) signal during a bus signaling phase, such as during a response phase. In a PtP system, the SPT message may be sent by the pushing agent or some other arbiter agent to the receiving processor(s). In other embodiments, no such notification is given to the receiving processor(s) and the data is simply delivered.
After the determination of the recipient processor(s) is made, the push data may be delivered to the recipient processor(s) and the non-recipient processor(s) may invalidate any prior copies of the data they may have in their internal caches. The recipient processor(s) receives the data from the pushing agent and stores it in its cache, overwriting any existing copy of the data.
In some applications in which embodiments of the invention may be used, the recipient processor(s) may require that the push data not be modified by subsequent cache write operations. Cached data within a processor may be replaced according to such algorithms, as a “least-recently used” (LRU) algorithm, not-recently used (NRU) algorithm, or round-robin algorithm, etc. Accordingly, at least one embodiment of the invention supports a command or other signal that may be issued along with the push data to prevent subsequent writes to the location the push memory is written (or “lock” the memory location). Other processors that did not receive the data from the push agetn may arbitrate with the processor(s) that did receive the data in order to access the data.
In other embodiments, the system of
Embodiments of the invention may be implemented using complementary metal-oxide-semiconductor (CMOS) logic circuits (“hardware”), whereas other embodiments may be implemented using a set of instructions (“software”) stored on a machine-readable medium, which when executed by a machine, cause the machine to perform operations commensurate with the various embodiments described herein. Other embodiments may be implemented using some combination of hardware and software.
While the invention has been described with reference to illustrative embodiments, this description is not intended to be construed in a limiting sense. Various modifications of the illustrative embodiments, as well as other embodiments, which are apparent to persons skilled in the art to which the invention pertains are deemed to lie within the spirit and scope of the invention.