The invention generally relates to computer systems and in particular relates to pipelined bus systems for use in interconnecting devices within a computer system.
It is common for computer bus systems to allow pipelining of multiple bus transactions. Pipelining occurs when subsequent transaction requests are made prior to the completion of outstanding transaction requests. Bus transactions may be requested for one transaction while data transfer is performed concurrently for a separate bus transaction. In this manner, multiple transactions can be processed in parallel with different phases of the separate bus transactions performed concurrently.
Power consumption is an important issue for computer systems and integrated systems-on-a-chips (SOC's). In some applications, low power consumption may be more important and desirable than high performance, such as in battery-powered cellphone or personal digital assistant (PDA) applications. Therefore, it is acceptable to provide a means to select a reduction in processing capability where a corresponding reduction in power consumption is obtained. For example, it may be preferable to reduce transaction processing capability as a trade-off for a reduction in power consumption. This can be accomplished by varying the depth of the bus pipeline, wherein a shorter bus pipeline would decrease processing capability, slow the system and reduce power consumption by spreading the bus transaction requests over a longer period of time, effectively reducing the number of transfer requests per unit of time. A reduction in the number of transfer requests per unit of time results in reductions in the power consumptions of the devices receiving such requests.
It is desirable to provide a bus mechanism to allow the depth of the bus pipeline to be varied in accordance with power consumption constraints or particular processing constraints of the agents connected to the bus. One proposed solution is taught by U.S. Pat. No. 5,548,733 to Sarangdhar et al., entitled “METHOD AND APPARATUS FOR DYNAMICALLY CONTROLLING THE CURRENT MAXIMUM DEPTH OF A PIPELINED COMPUTER BUS SYSTEM.” Individual bus agents transmit signals to the central arbiter, which responsively controls the depth of the pipeline in accordance with the signals received from each of the various bus agents. Thus, one slave device effectively sets a common lowest-common denominator master pipeline depth for all slaves.
However, non-optimal results are obtained where each of the devices or agents connected to the bus is not capable of accommodating the maximum depth of the bus pipeline. For example, if one particular agent is only capable of accommodating a pipeline having a depth of two, then, even though other agents can pipeline deeper, this capability is never utilized. Therefore, the effective maximum depth for a bus pipeline is constrained by the agent that can accommodate only the shallowest bus pipeline depth.
Moreover, the bus depth to which any particular agent is capable of accommodating may depend on the particular hardware of the bus agent or it may depend upon the current state of the bus agent. For example, during start-up or initialization of an agent, the agent may not be capable of processing any bus transactions, whereas once the agent has been fully activated, the agent may be capable of accommodating a bus pipeline depth of five. In other cases, the maximum depth to which an agent can process bus requests is a dynamic function of the current state of input and output queues connected to the bus. For example, if the input queue of a bus agent is empty, the agent may be capable of accommodating a pipeline depth of five. However, if the queue of the bus agent is full or nearly full, the agent may only be capable of accommodating a bus pipeline depth of one or two.
What is needed is a system and method that allows a request pipeline depth setting such that not all masters and cache snooping devices in a system are bound to the least common denominator of all receiving devices. What is also needed is a dynamic adjustment means of pipeline depth setting, wherein the pipeline depth setting can be increased or decreased responsive to the capabilities of the system devices, processing capability and power consumption requirements.
A method and system for a pipelined bus interface macro for use in interconnecting devices within a computer system. The system and method utilizes a pipeline depth signal that indicates a number N of discrete transfer requests that may be sent by a sending device and received by a receiving device prior to acknowledgement of receipt of the transfer requests by the receiving device. The pipeline depth signal may be dynamically modified, enabling a receiving device to decrement or increment the pipeline depth while one or more unacknowledged requests have been made. The dynamic modifications may occur responsive to many factors, such as an instantaneous reduction in system power consumption, a bus interface performance indicator, a receiving device performance indicator or a system performance indicator.
Referring now to
The system 10 architecture supports read and write transfers between all master devices 18 initiating requests and all slave devices 14 receiving requests (the master/slave interface). It also supports snoop requests broadcast to the snooping master devices 16 and acknowledgement sent back with the results from each. The snooping master devices 16 interface the snoop bus 19 with their cache components 17 to provide cache management functions (the snoop interface). Each master device 18 is attached to the PLB macro 12 through separate address 182, control 184, read data 186, and write data 188 buses. Each slave device 14 is attached to the PLB macro 12 via address 142, control 144, read data 146, and write data 148 buses. A slave 14 may or may not share this set of signals with another slave 14. The separate master read 186 and write 188 data buses allow master devices 18 to simultaneously receive read data from the PLB macro 12 and provide write data to the PLB macro 12. The separate slave read 146 and write 148 data buses allow slave devices 14 to simultaneously provide read data to the PLB macro 12 and accept write data from the PLB macro 12.
The PLB macro 12 bus structure can be implemented in a wide variety of configurations. The basic architecture is a “store and forward” design where all master request information is registered and subsequently broadcast to a slave 14. Read and write data is also registered in the PLB macro 12 between the master 18 and slave 14. The PLB macro 12 can be implemented as a simple shared bus where master devices 18 arbitrate for control of a single bus to which all slave devices 14 are attached. The bus structure can also be built as a crossbar design where multiple slave buses exist and master devices 18 arbitrate for these buses independently. The slave buses are differentiated by address mapping which is performed inside the PLB macro 12. Slaves 14 on each way decode addresses broadcast on their way to determine if they are the device being selected for data transfer. A crossbar allows dataflow between more than one slave bus segment at a time. This parallelism enables simultaneous data transfer in the same direction between different master and slave devices. Crossbar structures attaching a single slave to one slave bus segment and multiple slaves attached to another segment can be implemented. This is referred to as a hybrid crossbar. This allows a single design to be used in multiple SOCs with slightly different combinations of devices attached. Slave devices which do not require the highest performance, or frequent access, will typically be attached to a shared bus segment when a single segment is not available for each slave.
The system 10 architecture supports multiple processor cores 11 attached to the PLB bus macro 12 as autonomous master units. The architecture includes hardware enforced cache coherency. The PLB bus macro 12 supports shared coherent memory between multiple processors 11 and other caching coherent snooping master devices (“snoopers”) 16. This is achieved by a dedicated address snooping interface between the PLB bus macro 12 and snooping devices 16. “Snooping” is defined as the task of a caching coherent snooping master device 16 to monitor all memory accesses to coherent memory to ensure that the most recent version of memory data is presented to a requesting master. Cache snooping devices 16 are also described in a related U.S. patent application to Dieffenderfer et al., entitled “TARGETED SNOOPING” filed on Mar. 20, 2003 Ser. No. 10/393,116, the entire disclosure of which is hereby incorporated into this application.
The PLB bus macro 12 divides coherent memory into 32-byte coherence granules. This allows the PLB bus macro 12 to perform snoop operations on multiple coherence granules during a single operation, or tenure. This is referred to as the “snoop burst protocol.” The snoop burst protocol allows processors 11, which are typically operating at a higher frequency than the PLB bus macro 12, to pipeline snoops into their cache hierarchy, ultimately producing better performance for both the processor 11 and the PLB bus macro 12.
To maximize propagation time and provide a deterministic timing budget between the various agents attached (for example cores 11, memory controllers 13, slaves 14, etc.), the PLB bus macro 12 implements a fully synchronous “register-to-register pulse protocol.”
Referring now to
The system 10 architecture uses a “request and acknowledge” protocol for master devices 18 to communicate with the PLB macro 12, for the PLB macro 12 to communicate with slave devices 14, and for snoop requests from the PLB macro 12 to communicate with the snoop devices 16. An information transfer begins with a request and is confirmed with an acknowledge. Master 18 requests are made to the bus macro 12 and some time later these requests are broadcast to the slave 14, and to cache snooping devices 16 if a cache snooping operation is requested. Two types of information requests may be performed.
Thus, the IAP mode mechanism has the effect of eliminating latency associated with making subsequent requests. What is important in the present invention is that a means is provided within all request/acknowledge interfaces such that the device that is making transfer requests knows how many unacknowledged requests the receiving device can accept (“N”) prior to an overflow situation. This limits the requesting device from making an N+1 request. In a preferred embodiment, this is accomplished by “request pipeline depth” signals asserted by the receiving device and sampled by the requesting device.
The PLB bus macro 12 architecture also supports Intra-Address pipelined (IAP) snoop tenures. As described above, the broadcasting of multiple requests between devices prior to a response signal from the destination is allowed. This approach applied to the snoop bus interface allows multiple snoop tenures to be broadcast to the snooping devices 16 before receiving any response signals for preceding snoop request signals. This coupled with the snoop burst protocol allows very efficient communication of snoop information between the PLB bus macro 12 and the snooping devices 16.
What is also important is that the present invention enables a receiving device to dynamically decrement, and subsequently increment, the request pipeline depth while one or more unacknowledged requests have been made. This may occur for several reasons; for example, responsive to a request for a reduction in instantaneous power consumption. The number of total requests made is not reduced just how many are issued per unit time creating a lower power quiescent mode of operation. This can be advantageous from a system level perspective when it is recognized that the initiation of each request increases the power consumption of the PLB bus macro 12 and of all devices receiving such requests and their associated processing activities based on these requests.
The request pipeline depth signals are sampled by the “sending device” (from master 18 to bus macro 12, or PLB macro 12 to slave 14 or snooper 16) prior to asserting the request signal. They are asserted by the “receiving device” to indicate to the “sending device” how many requests can be request pipelined to it prior to completion of the first, or oldest, “request tenure.” The request tenure may be defined as the period from the assertion of a request signal until the sampling of a corresponding response signal by a requesting device. The sending device must sample these pipeline depth signals prior to a request and not assert its request signal when the depth is reached until completion of the earliest preceding request tenure. The value of a request pipeline depth signal can both increase and decrease while an uncompleted request tenure exists between the “sending device” and the “receiving device”. Two cases exist. In the first case, the “receiving device” increases the value of the request pipeline depth signals to the “sending device”. Once the “sending device” detects this, it is allowed to immediately request pipeline to the new depth. In the second case, the “receiving device” decreases the value of the request pipeline depth signals to the “sending device.” Following the decrease of the pipeline depth value while the “sending device” has an outstanding uncompleted request tenure, the “receiving device” assumes the request pipeline depth is only reduced by a depth of one with each assertion of a request tenure completion subsequent to its decrementation of the pipeline depth. This decrementation mechanism could be utilized by the “receiving device” to reduce instantaneous power consumption by “spreading” pipelined requests out over a larger time period.
Subsequent to the first request 70, the pipeline depth indication 76 immediately decrements all the way down to one-deep during Cycle 2, as indicated by the binary value of “00.” The master device 18 does not recognize this decrementation until Cycle 8, when the master 18 has sampled the assertion of the PLB—M0addrAck signal 106 completing the oldest outstanding request tenure. Whereupon the master device Effective ReqPipe value 78 is accordingly decremented. As the master 18 continues to sample the PLB—M0addrAck signal 106, the Effective ReqPipe value 78 is decremented down to one deep over the three Cycles 8, 9 and 10. In Cycle 10, the master 18 is aware that the pipeline depth is one-deep. At the rising edge 22 of clock signal 20 in Cycle 11, the master 18 samples the last PLB—M0addrAck signal 110, which indicates that the pipeline is now empty (there are no outstanding transfer requests). The master 18 can now make the fifth request 74 because the pipeline is empty and the pipeline depth 76 indicates that a pipeline depth of one is available. At the rising edge 22 of clock signal 20 in Cycle 12, the master 18 then asserts the last request 74.
Master transfer attributes (read or write, byte enables and requested size of transfer) are also indicated in the timing diagram shown in
In Cycle 1, the master 18 initiates a first read request 70 to the PLB macro 12. The request 70 gets registered by the PLB bus macro 12, and the PLB bus macro 12 arbitrates the request 70 during Cycle 2 and initiates a slave request 92 at the rising edge 22 of clock signal 20 in Cycle 3 in a “store and forward” type of structure to the appropriate slave device 14. During the arbitration process in Cycle 2, the PLB macro 12 determines that the slave request pipeline depth value 90 is set to four-deep (binary “11”). Accordingly, the PLB macro 12 immediately transmits the series of four requests 92–95 generated by the master 18 to the slave device 14.
In the example illustrated in
As described earlier in the specification, due to the register-to-register protocol of the present system 10, the transfer requests 92–95 initiated in Cycle 3 do not begin to be acknowledged 100 by the slave until Cycle 5. At the rising edge 22 of clock signal 20 in Cycle 3, the PLB macro 12 tells the slave 14 that it wants to request a data transfer by asserting the request signal 92. At the rising edge 22 of clock signal 20 in Cycle 4, the slave 14 captures the request 92 and, during Cycle 4, the slave 14 decodes a response. Then at the rising edge 22 of clock signal 20 in Cycle 5, the slave 14 asserts an acknowledge signal 100.
The PLB macro proceed signal 104 value is related to the cache snooping results of the present system 10. When the proceed signal 104 is asserted, then the current transaction will not be terminated and the slave 14 goes forward with providing the read data.
Returning again to the slave acknowledgment signal 100, it is asserted at the rising edge 22 of clock signal 20 in Cycle 5 by the slave 14 to communicate to the PLB macro 12 that the slave 14 has acknowledged and received bus request 92, which is the master request 70 through the PLB macro 12. After a latent Cycle 6 created by the register-to-register protocol architecture, the PLB macro 12 processes the slave acknowledgment signal 100 and asserts the master acknowledgement signal “PLB—M0addrAck” 106 in Cycle 7. Responsive to each of the acknowledged request transactions 107–110, the effective request pipeline depth value 78 is decremented by one, so that at the rising edge 22 of clock signal 20 in Cycle 10 the effective request pipeline depth is now one-deep. Accordingly, at Cycle 12 the master device 18 now asserts the last request 74 of the five desired requests 70–74.
Master request 74 is then captured by the PLB macro 12 and is forwarded on to the slave as PLB request 120 at the rising edge 22 of clock signal 20 in Cycle 14. In a similar fashion to the preceding transfers described above, the slave captures the request 120, decodes and responds to the request 120 with an address acknowledgment 122 at the rising edge 22 of clock signal 20 in Cycle 16. The PLB macro 12 asserts the PLB—M0addrAck signal 106 at the rising edge 22 of clock signal 20 in Cycle 18. Several clocks later at the rising edge 22 of clock signal 20 in Cycle 21, the slave 14 returns the read data back to the PLB macro 12 as indicated by the assertion of the S10—rdDAck signal 105. The PLB macro 12 captures the read data. The PLB macro 12 then forwards the data on to the master 18 at the rising edge 22 of clock signal 20 in Cycle 23. Thus, the master 18 decrements the effective pipeline depth downward for each request acknowledgement 107–110 that it gets responsive to its requests 70–74 until it recognizes that it can go no lower (i.e., it has reached “00” for maximum depth of one request at a time), until the PLB macro 12 indicates to the master 18 that it increased the pipeline depth.
As illustrated, the PLB bus macro 12 can dynamically make instantaneous changes in pipeline depth settings by increasing or decreasing the pipeline depth settings responsive to outstanding transactions. The pipeline depth indicator PLB—M0ReqPipe 76 value may also be dynamically increased or decreased responsive to the performance and power consumption of the PLB macro 12, the master 18 and slave devices 14 (the master/slave interface), the processor 11 cache components 17 and the snooping master devices 16 (the snoop interface).
One advantage of the present invention is that there is no requirement that the requesting device and receiving device be in an “idle” state when the pipeline depth is changed. Also, this invention indicates the maximum pipeline depth immediately thus allowing high throughput intra-address pipelining without sampling the pipeline depth or receiving transfer credit signals prior to each individual request. No clocks must be inserted between depth change indications. Another important advantage of the present invention is that each interface between each master, slave, cache snooper and PLB bus macro can be set to any desired pipeline depth independently of any other master, slave, cache snooper and PLB bus interface. Since the pipeline depth of each interface can be independently adjusted, the present invention provides the best mix and highest granularity of system performance and power savings management.
The snoop interface can also be controlled in order to modulate the cache performance and power consumption of the caches.
While preferred embodiments of the invention have been described herein, variations in the design may be made, and such variations may be apparent to those skilled in the art of computer system design, as well as to those skilled in other arts. The computer system and device components identified above are by no means the only system and devices suitable for practicing the present invention, and substitute system architectures and device components will be readily apparent to one skilled in the art. The scope of the invention, therefore, is only to be limited by the following claims.
Number | Name | Date | Kind |
---|---|---|---|
5548733 | Sarangdhar et al. | Aug 1996 | A |
5699516 | Sapir et al. | Dec 1997 | A |
5784579 | Pawlowski et al. | Jul 1998 | A |
5881277 | Bondi et al. | Mar 1999 | A |
5948088 | Sarangdhar et al. | Sep 1999 | A |
6009477 | Sarangdhar et al. | Dec 1999 | A |
6076125 | Anand | Jun 2000 | A |
6081860 | Bridges et al. | Jun 2000 | A |
6154800 | Anand | Nov 2000 | A |
6266750 | DeMone et al. | Jul 2001 | B1 |
6317803 | Rasmussen et al. | Nov 2001 | B1 |
6317806 | Audityan et al. | Nov 2001 | B1 |
6460119 | Bachand et al. | Oct 2002 | B1 |
6578116 | Bachand et al. | Jun 2003 | B2 |
6633972 | Konrad | Oct 2003 | B2 |
6665756 | Abramson et al. | Dec 2003 | B2 |
6668309 | Bachand et al. | Dec 2003 | B2 |
6772254 | Hofmann et al. | Aug 2004 | B2 |
6839833 | Hartnett et al. | Jan 2005 | B1 |
6848029 | Coldewey | Jan 2005 | B2 |
20020010831 | DeMone et al. | Jan 2002 | A1 |
20020053038 | Buyuktosunoglu et al. | May 2002 | A1 |
20020078269 | Agarwala et al. | Jun 2002 | A1 |
20020116449 | Modelski et al. | Aug 2002 | A1 |
20020120796 | Robertson | Aug 2002 | A1 |
20020120828 | Modelski et al. | Aug 2002 | A1 |
20030131164 | Abramson et al. | Jul 2003 | A1 |
20030140216 | Stark et al. | Jul 2003 | A1 |
20040010652 | Adams et al. | Jan 2004 | A1 |
20040064662 | Syed et al. | Apr 2004 | A1 |
Number | Date | Country | |
---|---|---|---|
20040236888 A1 | Nov 2004 | US |