One or more embodiments of the invention relate generally to the field of integrated circuit and computer system design More particularly, one or more of the embodiments of the invention relates to a method and apparatus for an open loop buffer allocation to sustain read streaming with minimal read buffer size.
Communications between devices that make up an electronic system are typically performed using one or more busses that interconnect such devices. These busses may be dedicated busses coupling only two devices, or they may be used to connect more than two devices. The busses may be formed entirely on a single integrated circuit die, thus being able to connect two or more devices on the same chip. Alternatively, a bus may be formed on a separate substrate than the devices, such as on a printed wiring board.
As operating frequency and speed of certain devices has increased, the rate at which such devices can supply data may exceed the maximum data rate of slower devices. In other words, based on the operating frequency and speed of a source device, the rate of data bandwidth from a fast source device may exceed the rate of data bandwidth that can be successfully handled by a slow target device. Accordingly, buffer overflow may occur when a fast source device is writing to a slow target device.
One traditional technique for avoiding buffer overflow between fast source and slow target devices is a closed allocation loop scheme. Closed loop allocation uses feedback regarding remaining buffer space to avoid buffer overflow. Close loop allocation also requires a deeper size for the read buffer to ensure streaming of read data. Unfortunately, the deeper buffer size results in an increased gate count, increased die size and ultimately, higher costs. However, as a result of budgetary conditions, limitations on gate count and die size are generally imposed on product manufacturers.
Accordingly, conventional buffering of data, when writing from a fast source device to a slow target device, is generally performed according to a closed-loop scheme by using feedback about available space in the read buffer to determine when to launch additional data requests. Hence, a request is not launched to memory if there is no corresponding space available in a buffer. However, if die size is limited, closed-loop allocation schemes will lead to performance degradation within high performance hardware configurations.
The various embodiments of the present invention are illustrated by way of example, and not by way of limitation, in the figures of the accompanying drawings and in which:
A method and apparatus for an open loop buffer allocation are described. In one embodiment, the method includes loading requested data within a buffer according to a load rate. Concurrent with the loading of data within the buffer, the data is forwarded (drained) from the buffer according to a drain rate. In situations where the load rate exceeds the drain rate, read requests may be throttled during detected buffer capacity conditions according to an approximate buffer capacity level. In one embodiment, a rate for issuing data requests, for example, to memory, is regulated according to a predetermined buffer accumulation rate. Accordingly, in one embodiment, the open loop allocation scheme reduces latency while enabling sustained read streaming with a minimal size read buffer.
System Architecture
Representatively, chipset 200 may include graphics block 110, such as, for example, a graphics engine or chipset, as well as hard drive devices (HDD) 130 and main memory 120. In one embodiment, chipset 200 includes a memory controller and/or an input/output (I/O) controller. In an alternate embodiment, chipset 200 may operate as or include a system controller. In one embodiment, memory 200 is a multiple channel memory, such as a dual channel memory, and may include, but is not limited to, random access memory (RAM), dynamic RAM (DRAM), static RAM (SRAM), synchronous DRAM (SDRAM), double data rate (DDR) SDRAM (DDR-SDRAM), Rambus DRAM (RDRAM) or any device capable of supporting high-speed buffering of data.
Representatively, graphics 110 may be configured as an integrated graphics chipset, including a graphics accelerator. The graphics accelerator may include an instruction processing unit to control the graphics engine. As illustrated, chipset 200 provides graphics engine 110 with data from memory channels 120. In one embodiment, graphics engine 110 requires high data bandwidth, such as determined by a burst group length supported by graphics engine 110. As a result, the performance of graphics engine 110 is directly related to the amount of available bandwidth from memory 120.
As further illustrated, a plurality of I/O devices 140 (140-1, . . . , 140-N) may be coupled to chipset 200 via bus 150. As described above, each device that resides on a bus (e.g., I/O, memory, graphics, FSB or other bus) is referred to as a bus agent. In one embodiment, each bus agent arbitrates for bus ownership by asserting a bus request signal. In one embodiment, computer system 100 may be configured according to a three-bus system, including, but not limited to, an address bus, a data bus and a transaction bus. Accordingly, a bus agent issues an address bus request signal (ABR), a data bus request signal (DBR) or a transaction bus request (TBR) signal to request bus ownership to issue bus transactions.
A bus transaction can exhibit several bus protocol events. These include an arbitration event to determine bus ownership, between competing bus agents. Thereafter, the transaction enters the request phase where the bus owner drives transaction address information. Accordingly, when the request phase includes a data request, the bus agent requesting data may be referred to herein as an “initiator bus agent”. Following transaction initiation, a data phase results in a bus agent providing the requested data to the initiator bus agent. As described herein, the bus agent from which data is requested is referred to herein as a “completer bus agent”. As further described herein, the completer bus agent may be referred to as a “master bus agent”, whereas the initiator bus agent may be referred to as a “target bus agent”.
Accordingly, computer systems, such as computer system 100, generally utilize shared bus architectures to provide communication among devices. Devices, such as processors, memory controllers, I/O controllers and direct memory access (DMA) units are usually connected via a shared bus. In general, only one device can drive the bus at a given time. Hence, it is necessary to arbitrate between devices requesting bus ownership to prevent multiple devices from driving the bus simultaneously.
Within computer system 100, the rate at which a master bus agent (e.g., memory 120) can supply data may exceed the maximum bandwidth supported by a target bus agent (e.g., graphics engine 110) in high performance system configurations. As a result, buffering of such data prior to forwarding of the data to the target bus agent may lead to buffer overflow. Conventional techniques for averting buffer overflow include closed loop allocation schemes, which use feedback about remaining space in a read buffer, and generally require a deeper sized buffer to ensure streaming of read data. However, when gate count budgets and die size are restricted, such budgetary concerns prohibit the use of conventional closed loop allocation schemes.
Accordingly, in one embodiment, buffer logic 210 performs open loop buffer allocation. As illustrated in
Representatively, as illustrated in
In one embodiment, approximation of the buffer capacity level of buffer 280 without feedback information begins by analyzing system configuration parameters. For example, in one embodiment, a memory clock frequency of memory 120 is, for example, 166 megahertz (MHz). In the embodiment illustrated, memory 120 is configured as a dual channel DDR memory resulting in a clock period of 6 nanoseconds (ns). Conversely, in one embodiment, graphics clock frequency is equal to 266 MHz, resulting in a clock period of 3.75 ns. As further illustrated, dual channel memory 120 enables the reading of a hex word (HW) defined as 256 bits, or 32-bytes, of data during each memory clock period.
Conversely, graphics engine 110 is able to support the forwarding of an octal-word (OW) defined as 128 bits, or 16-bytes, of data during each graphics clock period. Representatively, in this configuration, the load rate of data into read buffer 280 is 1 HW of data every memory clock (or 256 bits every 6 ns) for an effective load rate of 5.33 megabits per second (M/s). Conversely, the effective drain rate of data from read buffer 280 to graphics engine 110 is 1 OW of data every graphics clock (or 128 bits every 3.75 ns) for an effective drain rate of 4.26 M's.
Hence, the load-to-drain rate ratio is 1.25 (i.e., a 5:4 load-to-drain ratio) in an equal time elapsed interval. Accordingly, based on a predetermined load-to-drain rate ratio, in one embodiment, a load constant is set to a value equal to the load rate. In one embodiment, the load constant is used to program a load drain timer 262. In one embodiment, the timer 262 counts down to a value of zero as long as a read request is acknowledged or the accumulation counter indicates outstanding data. Once timer 262 expires, the programmed load constant is reloaded and countdown continues as long as there is further committed data to process.
In one embodiment, counter increment logic 260 includes load/drain counter 262. Representatively, once load/drain timer 262 expires, accumulation counter 250 is incremented. In one embodiment, accumulation counter 250 represents an approximate buffer accumulation depth. In one embodiment, accumulation counter 250 is initialized to zero and incremented in units of HW by the amount of read data committed to the read buffer (32-bytes every load clock). Conversely, accumulation counter 250 is decremented in units of HW by an amount of read data that has been drained within one to drain-to-load ratio period.
In a further embodiment, a constant value is used to determine a number of minimum buffer slots required to prevent buffer overflow. Accordingly, a minimum buffer slots value is a measure of how close buffer 280 is to getting full. In determining the minimum buffer slots value, an extra margin of safety is provided to account for system boundary conditions. As further illustrated in Table 1, due to discrepancy from a load clock domain to a drain clock domain, a crossing clock penalty from the load clock domain to the drain clock domain is calculated to determine the minimum buffer slots value.
For example, as illustrated in Table 1, it takes six drain clocks of elapsed time from loading the first 32-bytes of data in buffer 280 in memory clock domain to completion of draining the first 32-bytes of data from buffer 280 in graphics clock domain. In other words, starting from an empty read buffer 280 during the first six memory clocks, there is no concurrent load and drain of data to graphics engine 110. After this initial period, load and drain happen concurrently at steady state with the deterministic load-to-drain ratio. In one embodiment, this initial period determines the minimum buffer slots value that must not be visible to steady state operation.
Accordingly, based on the sample system parameters above, six drain clocks equate to four load clocks. In one embodiment, this value of four load clocks equates to four buffer slots of reserved storage for the load-to-drain crossing penalty of Table 1 and serves as a baseline to select a buffer full constant value. In one embodiment, the approximate buffer level is measured by accumulation counter 250, which is incremented each time load/drain timer 262 expires. In one embodiment, buffer 280 may include a buffer depth (256 bits) equal to eight. Hence, the buffer full constant value may be set to four. Accordingly, in one embodiment, a buffer capacity condition is detected when accumulation counter 250 is equal to the buffer full constant value.
In one embodiment, detection of a buffer capacity condition causes command controller 220 to throttle issuance of read requests to, for example, memory 120. Representatively, rest timer logic 240 may be programmed according to a predetermined rest delay to increase a number of free buffer slots in buffer 280 to avoid buffer overflow. Accordingly, computer system 100 is able to sustain continuous read streaming required by, for example, graphics engine 110 while avoiding frequent start data streaming/stop data streaming type behavior to minimize arbitration penalties resulting from unavailability of data.
Representatively, full flag 360 is asserted when accumulation counter signal 330 reaches a preprogrammed value, such as the buffer full constant value. However, as described herein, the terms “assert”, “asserting”, “asserted”, “assertion”, “set(s)”, “setting”, “deasserted”, “deassert”, “deasserting”, “deassertion” or the like terms may refer to data signals, which are either active low or active high signals. Therefore such terms, when associated with a signal, are interchangeably used to require either active high or active low signals.
Accordingly, once full flag 360 is asserted indicating a buffer capacity condition, buffer capacity logic 230 will direct command controller 220 to throttle issuance of read requests until rest timer logic 240 has expired. In one embodiment, a value of rest timer logic 240 should be an interval long enough to drain buffer 280 from the full level down to a level X from where the quality of load-to-drain visible latency versus drain of remaining data in the buffer is equal. Selecting a sufficient rest interval 380 will give continuous bursts of data on the drain side.
In one embodiment, buffer level X from restart to full determines a length of the next burst group. As described herein, a burst of data requests are issued to memory to provide constant read streaming of data to graphics engine 110. In the above example, the initial latency in load clocks as described above is equal to four clocks. Thus, a value of five is chosen as the predetermined number of rest clock periods (in the load clock domain). During this period, read requests to memory errors are suppressed. In addition, the rest timer times an inactive load period to allow the drain side of the read buffer to reduce the buffer level.
Representatively, the open loop allocation policy supports configurations where the load rate in the buffer is less than or equal to the drain rate. However, calculation of the load-to-drain ratios, full constant settings and crossing clock penalties will vary according to the various load clock domains and drain clock domains of a system. Accordingly, the system configuration parameter values described herein are provided to illustrate one or more embodiments and should not be interpreted to limit or narrow the embodiments described herein. Although the above description is in the context of the load being memory and the drain being a graphics engine, other sources and drains for data may benefit from embodiments described herein. Procedural methods for implementing one or more embodiments are described.
Operation
Referring again to
Due to the difference in clock frequency between the load clock domain and the drain clock domain, as well as the load clock domain bandwidth, at process block 430, a rate of issuing data requests is regulated according to an approximate buffer capacity level to prohibit buffer overflow. In other words, an effective load rate from a master bus agent may exceed an effective drain rate of data to a target bus agent. As a result, buffering of such data may cause buffer overflow depending on a burst length of a data request. Hence, at process block 440, issuance of data requests to a master bus agent is throttled during detected buffer capacity conditions according to a predetermined buffer accumulate rate.
Open loop allocation, as described herein, may be used where die size is limited, which often prohibits the use of closed loop allocation schemes. Utilizing proposed open loop allocation scheme embodiments described herein, latency is reduced compared to closed loop allocation schemes while enabling, for example, a memory controller to sustain read streaming with a minimal size read buffer. Embodiments described herein facilitate maximum bandwidth usage for system configurations and also avoid read buffer overflow for system configurations where master bus agent bandwidth exceeds maximum bandwidth that can be supported by a target bus agent.
In any representation of the design, the data may be stored in any form of a machine readable medium. An optical or electrical wave 660 modulated or otherwise generated to transport such information, a memory 650 or a magnetic or optical storage 640, such as a disk, may be the machine readable medium. Any of these mediums may carry the design information. The term “carry” (e.g., a machine readable medium carrying information) thus covers information stored on a storage device or information encoded or modulated into or onto a carrier wave. The set of bits describing the design or a particular of the design are (when embodied in a machine readable medium, such as a carrier or storage medium) an article that may be sealed in and out of itself, or used by others for further design or fabrication.
It will be appreciated that, for other embodiments, a different system configuration may be used. For example, while the system 100 includes a single CPU 102, for other embodiments, a multiprocessor system (where one or more processors may be similar in configuration and operation to the CPU 102 described above) may benefit from the open loop allocation scheme of various embodiments. Further different type of system or different type of computer system such as, for example, a server, a workstation, a desktop computer system, a gaming system, an embedded computer system, a blade server, etc., may be used for other embodiments.
Having disclosed exemplary embodiments and the best mode, modifications and variations may be made to the disclosed embodiments while remaining within the scope of the embodiments of the invention as defined by the following claims.