One embodiment of the invention is a computing system 100 as shown in
The computing system 300 includes four processing units 310a, 310b, 310c, 310d. Each of the processing units is coupled to a crossbar switch 320. Incorporated within, or coupled directly with, the crossbar switch 320 is a memory 330. Also residing within the switch 320 is control logic 340 and a tag bank 350, whose functionality is described below. A system architecture employing multiple processing units and a switch as shown in
The crossbar switch 320 acts as a switching system configured to allow communication, such as transference of data, between any two of the processing units 310. Further, communication between any of the processing units 310 may occur concurrently through the crossbar switch 320. Other information, such as status and control information, inter-processor messages, and the like, may be passed through the switch 320 between the processing units 310 in other implementations. In still other embodiments, switches other than crossbar switches which facilitate the passing of data between the processing units 310 may be utilized. In another implementation, more than one switch 320, one or more of which contains a memory 330, may be utilized and configured to form a switching system or “fabric” inter-coupling the various processing units 310. Under this scenario, the memory 330 may be distributed among two or more of the switches forming the switching fabric or system.
The memory 330 of the crossbar switch 320 may be any memory capable of storing some portion of data passing through the switch 320 between the processing units 310. In one implementation, the storage capacity of the memory 320 is at least one gigabyte (GB). Any of a number of memory technologies may be utilized for the memory 320, including, but not limited to, dynamic random access memory (DRAM) and static random access memory (SRAM), as well as single in-line memory modules (SIMMs) and dual in-line memory modules (DIMMs) employing either DRAMs or SRAMs.
A more detailed representation of one of the processing units 310a is presented in the block diagram of
Generally, each of the processing units 310 of the particular system 300 of
After the crossbar switch 320 receives a memory request from the processing unit 310a, the switch 320 may search its memory 330 for the requested data (operation 514). If the data is stored in the memory 330, the data is accessed and returned to the requesting processing unit 310 (operation 516). If not found, the switch 320 may determine which of the remaining processing units 310 possesses the data (operation 518), such as the particular processing unit 310 acting as the home location for the requested data, and direct the request thereto (operation 520). The processing unit 310 receiving the request accesses the requested data and returns it to the switch 320 (operation 522), which in turn forwards the requested data to the requesting processing unit 310 (operation 524). In addition, the switch 320 may also store a copy of the data being returned to the requesting processing unit 310 within its memory 330 (operation 526). Any of the processing units 310 may then access the copy of the data stored within the memory 330 (operation 528).
In the case in which the most recent version of the requested data is not located at the home processing unit 310, the home unit 310 may forward the request by way of the switch 320 to the particular processing unit 310 holding the most recent version of the requested data. In another implementation, the switch 320 may forward that request directly without involving the home unit 310. The unit 310 holding the most recent version may then return the requested data to the switch 320, which may then pass the data directly to the requesting unit 310. In a further embodiment, the switch 320 may also forward the most recent version to the home unit 310, which may then update its copy of the data.
In embodiments in which more than one switch 320 is employed within the computing system 300, more than one of the switches 320 may be involved in transferring data requests and responses between the various processing units 310. For example, upon receipt of a request for data from one of the processing units 310, one of the switches 320 may forward the request to another processing unit 310, either directly or by way of another switch 320. Data returned by a processing unit 310 in response to such a request may be returned to the requesting processing unit 310 in a similar manner. Further, one or more of the switches 320 through which the data passes may store a copy of that data for later retrieval by another processing unit 310 subsequently requesting that data.
Given that the single shared memory space is distributed among the several processing units 310, and also that each processing unit 310 may cache temporary copies of the data within its associated cache memories 314 or its local memory 318, a potential cache coherence problem may result. In other words, multiple copies of the same data, each exhibiting potentially different values, may exist. For example, if one processing unit 310 accesses data stored within the local memory 318 of another processing unit 310 through the switch 320, a question exists as to whether that data will ultimately be cached in the requesting processing unit 310, such as within one of the cache memories 314 or the local memory 318 of the processing unit 310a. Caching the data locally results in multiple copies of the data within the system 300. Saving a copy of the data within the memory 330 of the switch 320 also potentially raises the same issue.
To address possible cache coherency problems, the switch 320 may select which of the data passing through the switch 320 between the processing units 310 are stored within the memory 330. In one embodiment, such a selection may depend upon information received by the switch 320 from the processing unit 310 requesting the data. For example, the data requested may be accessed under one of two different modes: exclusive mode and shared mode. In shared mode, the requesting processing unit 310 indicates that it will not be altering the value of the data after it has been read. Oppositely, requesting access to data under exclusive mode indicates that the processing unit 310 intends to alter the value of the data being requested. As a result, multiple copies of that specific data being accessed under shared mode will all have the same consistent value, while a copy data being acquired under exclusive mode is likely to be changed, thus causing other copies of that same data to become invalid.
In one embodiment employing these two modes, the switch 320 may store data requested in shared mode in memory 330, if enough space exists within the memory 330. On the other hand, data passing through the switch 320 which is being accessed under exclusive mode will not be stored in the memory 330. Accordingly, data within the memory 330 of the switch 320 used to satisfy further data requests from one or more processing units 310 are protected from being invalidated due to alteration by another processing unit 310.
By storing at least some of the data passing through the switch 320 within the memory 330, the switch 320 may satisfy subsequent requests for that same data by reading the data directly from the memory 330 and transferring the data to the requesting processing unit 310. Otherwise, the request would be forwarded to the processing unit 310 possessing the data, after which the processing unit 310 servicing the request would read the data from its own local memory 318 and transfer the data to the switch 320, as described above. Only then would the switch 320 be capable of transferring the data to the requesting processing unit 310. Thus, in situations in which the memory 330 contains the requested data, latency between a data request and satisfaction of that request is reduced significantly. Also, overall traffic levels between the processing units 310 and the switch 320 are lessened significantly as a result due to the fewer number of data requests being forwarded to other processing units 310, thus enhancing the system 310 throughput and performance.
Presuming a finite amount of data storage available in the memory 330 of the switch 320, the memory 330 is likely to become full at some point, thus requiring some determination as to which of the data stored in the memory 330 is to be replaced with new data. To address this concern in one embodiment, the switch 320 may replace the data already stored in the memory 330 under at least one cache replacement policy. For example, the switch 320 may adopt a least-recently-used (LRU) policy, in which data in the memory 330 which has been least recently accessed is replaced with the newest data to be stored into the memory 330. In another implementation, the switch 320 may utilize a not-recently-used (NRU) policy, in which data within the memory 330 which has not been recently accessed within a predetermined period of time is randomly selected for replacement with the new data. Other cache replacement policies, including, but not limited to, first-in-first-out (FIFO), second chance, and not-frequently-used (NFU), may be utilized in other embodiments.
As described in some embodiments above, the memory 330 may be implemented as a kind of cache memory. As a result, the memory 330 may be designed in a fashion similar to an external cache memory, such as a level-4 (L4) cache sometimes incorporated in central processing unit (CPU) computer boards.
In one embodiment, the switch 320 employs control logic 340 which analyzes each request for data received from the processing units 310 to determine to which of the processing units 310 the request is to be directed. This function may be performed in one example by comparing the address of the data to be accessed with a table listing addresses or address ranges of the shared address space associated with particular processing units 310. As part of this analysis, the control logic 340 may also compare the address of the requested data with a “tag bank” 350 that includes information regarding whether the data is located in the memory 330, and, if so, the location of that data within the memory 330. In one example, a non-sequential tag look-up scheme is implemented to reduce the time required to search the tag bank 350 for information regarding the requested data.
To reduce the amount of information required in the tag bank 350, the shared memory area and, consequently, the memory 330 of the switch 320, may be organized in cache “lines.” with each line including data from multiple, contiguous address locations of the shared address space. Grouping locations of the address space in such a fashion allows a smaller tag bank 350 to be maintained and searched.
While several embodiments of the invention have been discussed herein, other embodiments encompassed by the scope of the invention are possible. For example, while specific embodiments of the invention described in conjunction with