The following detailed description makes reference to the accompanying drawings, which are now briefly described.
While the invention is susceptible to various modifications and alternative forms, specific embodiments thereof are shown by way of example in the drawings and will herein be described in detail. It should be understood, however, that the drawings and detailed description thereto are not intended to limit the invention to the particular form disclosed, but on the contrary, the intention is to cover all modifications, equivalents and alternatives falling within the spirit and scope of the present invention as defined by the appended claims.
Turning now to
The processor 12A is shown in greater detail in
The core 22 generally includes the circuitry that implements instruction processing in the processor 12A, according to the instruction set architecture implemented by the processor 12A. That is, the core 22 may include the circuitry that fetches, decodes, executes, and writes results of the instructions in the instruction set. The core 22 may include one or more caches. In one embodiment, the processors 12A-12B implement the PowerPC™ instruction set architecture. However, other embodiments may implement any instruction set architecture (e.g. MIPS™, SPARC™, ×86 (also known as Intel Architecture-32, or IA-32), IA-64, ARM™, etc.). In the illustrated embodiment, the core 22 includes a load/store (L/S) unit 30 including a load/store queue (LSQ) 32.
The interface unit 24 includes the circuitry for interfacing between the core 22 and other components coupled to the interconnect 20, such as the processor 12B, the L2 cache 14, the I/O bridge 16, and the memory controller 18. In the illustrated embodiment, cache coherent communication is supported on the interconnect 20 via the address, response, and data phases of transactions on the interconnect 20. Generally, a transaction is initiated by transmitting the address of the transaction in an address phase, along with a command indicating which transaction is being initiated and various other control information. Cache coherent agents on the interconnect 20 use the response phase to maintain cache coherency. Each coherent agent responds with an indication of the state of the cache block addressed by the address, and may also retry transactions for which a coherent response cannot be determined. Retried transactions are cancelled, and may be reattempted later by the initiating agent. The order of successful (non-retried) address phases on the interconnect 20 may establish the order of transactions for coherency purposes. The data for a transaction is transmitted in the data phase. Some transactions may not include a data phase. For example, some transactions may be used solely to establish a change in the coherency state of a cached block. Generally, the coherency state for a cache block may define the permissible operations that the caching agent may perform on the cache block (e.g. reads, writes, etc.). Common coherency state schemes include the modified, exclusive, shared, invalid (MESI) scheme, the MOESI scheme which includes an owned state in addition to the MESI states, and variations on these schemes.
The interconnect 20 may have any structure. For example, the interconnect 20 may have separate address, response, and data interfaces to permit split transactions on the interconnect 20. The interconnect 20 may support separate address and data arbitration among the agents, permitting data phases of transactions to occur out of order with respect to the corresponding address phases. Other embodiments may have in-order data phases with respect to the corresponding address phase. In one implementation, the address phase may comprise an address packet that includes the address, command, and other control information. The address packet may be transmitted in one bus clock cycle, in one embodiment. In one particular implementation, the address interconnect may include a centralized arbiter/address switch to which each source agent (e.g. processors 12A-12B, L2 cache 14, and I/O bridge 16) may transmit address requests. The arbiter/address switch may arbitrate among the requests and drive the request from the arbitration winner onto the address interconnect. In one implementation, the data interconnect may comprise a limited crossbar in which data bus segments are selectively coupled to drive the data from data source to data sink.
The core 22 may generate various requests. Generally, a core request may comprise any communication request generated by the core 22 for transmission as a transaction on the interconnect 20. Core requests may be generated, e.g., for load/store instructions that miss in the data cache (to retrieve the missing cache block from memory), for fetch requests that miss in the instruction cache (to retrieve the missing cache block from memory), uncacheable load/store requests, writebacks of cache blocks that have been evicted from the data cache, etc. The interface unit 24 may receive the request address and other request information from the core 22, and corresponding request data for write requests (Data Out). For read requests, the interface unit 24 may supply the data (Data In) in response to receiving the data from the interconnect 20.
Generally, a buffer such as the memory request buffer 26 may comprise any memory structure that is logically viewed as a plurality of entries. In the case of the memory request buffer 26, each entry may store the information for one transaction to be performed on the interconnect 20. In some cases, the memory structure may comprise multiple memory arrays. For example, the memory request buffer 26 may include an address buffer configured to store addresses of requests and a separate data buffer configured to store data corresponding to the request, in some embodiments. An entry in the address buffer and an entry in the data buffer may logically comprise an entry in the memory request buffer 26, even though the address and data buffers may be physically read and written separately, at different times.
In one embodiment, the memory request buffer 26 may be used as a load merge buffer for uncacheable load requests. A first uncacheable load request may be written to the memory request buffer 26, having a first address to which the load request is directed. Additional uncacheable load requests, if they have an address matching the first address within a defined granularity, may be merged with the first uncacheable load request. For example, the granularity may be larger than the size of the uncacheable load requests (e.g. two merged uncacheable load requests may each access one or more bytes not accessed by the other request). Generally, merging uncacheable load requests may include performing the same, single transaction on the interconnect 20 to concurrently satisfy each of the merged requests. That, is, a single transaction is performed in the interconnect 20 and data returned from the single transaction is forwarded as the load result in the core 22 for each of the merged requests. If uncacheable load requests may not be merged, separate transactions may be used for each respective uncacheable load request. In one embodiment, merging the uncacheable load requests may be implemented by updating the entry in the memory request buffer 26 that stores the first uncacheable load request to ensure the data for the merged uncacheable load request is also read in the transaction. To the extent that uncacheable load requests are successfully merged, bandwidth consumed on the interconnect 20 by the processor 12A may be reduced, in some embodiments. Performance may be increased due to the freed bandwidth and/or power consumption may be reduced, in various embodiments.
A load memory operation (or more briefly, “a load”) may be generated by the core 22 responsive to an explicit load instruction, or responsive to an implicit load specified by any instruction. Loads may be cacheable (i.e. caching of the load data is permitted) or uncacheable (caching of the load data is not permitted). Loads may be specified as cacheable or uncacheable in any desired fashion, according to the instruction set architecture implemented by the processor 12A. For example, in some embodiments, cacheability (or uncacheability) is an attribute specified in the virtual to physical translation data structures used to translate the load address from virtual to physical. In some embodiments, instructions may encode cacheability/uncacheability directly. Combinations of such techniques may also be used.
Addresses may match within the defined granularity if the addresses both refer to data within a contiguous block of aligned memory of the size of the granularity. More particularly, least significant address bits that define offsets within the aligned memory may be ignored when comparing the addresses for a match. For example, if the granularity is 16 bytes, the least significant 4 bits of addresses may be ignored when comparing for address match.
In various embodiments, the granularity may be fixed or programmable. The granularity may be defined based on a variety of factors. For example, the granularity may be based on the capabilities of the devices that are targeted by uncacheable requests, in some embodiments. Alternatively, the granularity may be defined to be the width of a single data transfer (or “beat”) on the interconnect 20 (or a multiple of the width of a data transfer). The granularity may be defined to be the size of a cache block in the caches of the processor 12A, in other embodiments.
The uncacheable loads may be stored in the LSQ 32 in the load/store unit 30. Based on various implementation-dependent criteria, each load may be selected for processing. The load/store unit 30 may generate an uncacheable load request to the interface unit 24, which may merge the uncacheable load request with a previously recorded uncacheable load request or allocate a new buffer entry in the buffer 26 for the uncacheable load request.
In one implementation, the memory request buffer 26 may be a unified buffer comprising entries that may be used to store addresses of core requests and addresses of snoop requests, as well as corresponding data for the requests. In one embodiment, the memory request buffer 26 may be used as a store merge buffer. Cacheable stores (whether a cache hit or a cache miss) may be written to the memory request buffer 26. Additional cacheable stores to the same cache block may be merged into the memory request buffer entry. Subsequently, the modified cache block may be written to the data cache. Uncacheable stores may also be merged in the memory request buffer 26.
The L2 cache 14 may be an external level 2 cache, where the data and instruction caches in the core 22, if provided, are level 1 (L1) caches. In one implementation, the L2 cache 14 may be a victim cache for cache blocks evicted from the L1 caches. The L2 cache 14 may have any construction (e.g. direct mapped, set associative, etc.).
The I/O bridge 16 may be a bridge to various I/O devices or interfaces (not shown in
The memory controller 18 may be configured to manage a main memory system (not shown in
Turning now to
In one embodiment, the control unit 40 may includes a set of queues (not shown in
An exemplary entry 44 is shown in the memory request buffer 26. Other entries may be similar. The entry 44 includes the address of the request and control/status information. The control/status information may include the command for the address phase, a transaction identifier (ID) that identifies the transaction on the interconnect 20, and various other status bits that may be updated as the transaction corresponding to a request is processed toward completion. The entry 44 may further include data (e.g. a cache block in size, in one embodiment) and a set of byte enables (BE). There may be a BE bit for each byte in the cache block. In one embodiment, a cache block may be 64 bytes and thus there may be 64 BE bits. Other embodiments may implement cache blocks larger or smaller than 64 bytes (e.g. 32 bytes, 16 bytes, 128 bytes, etc.) and a corresponding number of BE bits may be provided. The BE bits may be used for load merging, in some embodiments, and may also record which bytes are valid in the entry 44. For example, in one embodiment, a cache block of data may be transferred over multiple clock cycles on the interconnect 20. For example, 16 bytes of data may be transferred per clock cycle for a total of 4 clock cycles of data transfer on the interconnect 20 for a 64 byte block. Similarly, in some embodiments, multiple clock cycles of data transfer may occur on the Data Out/Data In interface to the core 22. For example, 16 bytes may be transferred per clock between the core 22 and the interface unit 24. The BE bits may record which bytes have been provided in each data transfer.
If the granularity for load merging is smaller than a cache block, only a portion of the BE bits may be used for a given uncacheable load request entry. The number of BE bits used may be based on the size of the granularity.
An exemplary entry 46 in the LSQ 32 is also shown in
The load/store unit 30 receives core load/store memory operations from the rest of the core 22. The memory operations may include the address of the memory operation (that is, the address to be read for a load or written for a store), the type information including load or store and cacheable or uncacheable, the register address for loads, the size of the operation, etc. The core 22 may use the core control interface to indicate that a memory operation is being provided. The control unit 42 may allocate an entry in the LSQ 32 to store the memory operation.
The remainder of this discussion will focus on the uncacheable load memory operation, and illustrate the uncacheable load merging. Generally, an uncacheable load may be selected by the control unit 42 for transmission to the interface unit 42 according to any set of criteria. For example, the uncacheable load may be nonspeculatively selected (e.g. after each prior memory operation in the LSQ 32 has been retired or at least is nonspeculative), selected in order but speculatively (e.g. selected after each prior memory operation in the LSQ 32 but without regard to being nonspeculative), speculatively selected ahead of other loads, speculatively selected without restriction, etc.
When the uncacheable load has been selected, the control unit 42 may provide the entry number of the uncacheable load in the LSQ 32 to the LSQ 32 to read the information used to generate the uncacheable load request to the interface unit 24. The request may include the address, type, and size of the load, for example.
The memory request buffer 26 may be configured to compare the request address to the addresses in the buffer entries in response to receiving the request. For example, the memory request buffer 26 may comprise a content addressable memory (CAM), at least for the address portion of the entry. For uncacheable loads, the comparison may be made according to the defined granularity mentioned above, and the comparison result may be used to detect a potential load merge. If a CAM match is detected and a load merge is not possible, the control unit 40 may use a replay control signal (part of the Other Ctl in
If a request is not replayed or merged, the request is written to a buffer entry in the memory request buffer 26 allocated by the control unit 40. If the request is not replayed, the control unit 40 may transmit the buffer ID of the buffer entry to which the request is written to the LSQ 32. The LSQ 32 may write the buffer ID to the entry corresponding to the request. Subsequently, the request may be selected by the control unit 40 to initiate its transaction on the interconnect 20. For uncacheable load transactions, a subsequent data phase returns the data from the target of the transaction. In one embodiment, the data provided from the interconnect 20 may also include a transaction ID (ID in
The data for the uncacheable load transaction may be forwarded from the memory request buffer 26 to the core 22 (e.g. to be written to a register file). The control unit 40 may also provide the buffer ID of the buffer entry from which data is being forwarded, and the LSQ 32 may compare the buffer ID to the buffer ID fields in its entries. The control unit 42 may select the oldest uncacheable load in the LSQ 32 which matches the buffer ID, and may read the RegAddr field of the entry to supply the register address for forwarding. The oldest uncacheable load may also be deleted from the LSQ 32 in response to the forwarding. The oldest uncacheable load may be the load that is prior, in program order, to other uncacheable loads to which it is being compared.
The control unit 42 may also provide an indication of the number of uncacheable loads that matched the buffer ID. The control unit 40 may repeat the forwarding a number of times equal to the number of loads, to forward data for each merged load.
It is noted that, while byte enables are used in the present embodiment to indicate which bytes are requested (e.g. for merged uncacheable loads), any indication of the data bytes being requested may be transmitted as part of the transaction for an uncacheable load request (or merged uncacheable load requests). For example, if merging were limited to requests that access a byte or bytes contiguous to bytes that were already requested, a byte count may be transmitted. In other embodiments, a given enable bit may correspond to more than one byte, if one byte granularity of data transfers is not supported on the interconnect 20.
The buffer 26 and LSQ 32 may comprise any type of memory. For example, the buffer 26 and LSQ 32 may comprise one or more random access memory (RAM) arrays, clocked storage devices such as flops, registers, latches, etc., or any combination thereof. In one embodiment, at least the portion of the buffer 26 that stores address bits and the portion of the LSQ 32 that stores the buffer ID may be implemented as a content addressable memory (CAM) for comparing addresses and buffer IDs as mentioned above.
It is noted that, while the LSQ 32 is shown in the illustrated embodiment, other embodiments may implement separate queues for loads and for stores.
The load address is compared to the addresses in the memory request buffer 26. If no match is detected at the granularity used for uncacheable loads (decision block 50, “no” leg), the control unit 40 may check if a memory request buffer (MRB) entry is available to store the uncacheable load request. If no entry is available (decision block 52, “no” leg), the control unit 40 may assert replay for the load request (block 54). If an entry is available (decision block 52, “yes” leg), the control unit 40 may allocate a buffer entry, and may write the load request into the allocated buffer entry (block 56). The byte enables in the buffer entry may also be initialized by setting the BE bits for bytes requested by the load and clearing other BE bits. Other control information may also be written to the allocated buffer entry. The bytes requested by the load comprise the byte addressed by the load address and a number of contiguous bytes based on the size of the load request (e.g. 1, 2, 4, or 8 bytes in one embodiment). Additionally, the control unit 42 may provide the buffer ID of the allocated buffer entry to the LSQ 32, which may write by the buffer ID to the entry storing the load memory operation corresponding to the load request (block 58).
If there is a match of the load address in the buffer 26 within the granularity for load (decision block 50, “yes” leg), the control unit 40 may determine if the entry that is matched is also an uncacheable load request. If the request is not a load, or is a cacheable load, then a load merge is not permitted in this embodiment. If the entry that is matched is not an uncacheable load request (decision block 60, “no” leg), the control unit 40 may assert replay for the load request (block 54). If the match is on a buffer entry that is storing an uncacheable load request (decision block 60, “yes” leg), the control unit 40 may determine if the merge of the load request into the buffer entry is permitted (decision block 62). There may be a variety of reasons why a load request is not permitted to be merged into the buffer entry (referred to as a “merge buffer” for brevity). For example, the merge buffer may be “closed” because the transaction for the request has been initiated on the interconnect 20. Additional details regarding the closing of a merge buffer are provided below. Additionally, in some embodiments, a load request that reads a byte that is also read by a previously merged load request may not be permitted. For example, if an uncacheable load results in a change of state to the targeted location (e.g. a clear-on read register), such a merge may not be permitted. Other embodiments may permit merging a load request that reads a byte that is also read by a previously merged load request. If the merge is not permitted (decision block 62, “no” leg), the control unit 40 may assert replay for the load request (block 54). If the merge is permitted (decision block 62, “yes” leg), the buffer 26 may update the BE bits in the merge buffer (block 64). That is, the BE bits for bytes read by the load request may be set (if not already set). The control unit 40 may provide the buffer ID of the merge buffer to the LSQ 32 to be stored in the LSQ entry corresponding to the load request (block 66, similar to block 58).
Merging of additional uncacheable load requests may be performed similar to
The memory request buffer 26 may receive the transaction ID transmitted in the data phase on the interconnect 20 and may use the portion of the transaction ID that identifies the buffer entry as a write index to the buffer 26. The buffer 26 may write the data into the data field of the identified buffer entry (block 70). The control unit 40 may wait for the core to be ready for a forwarding of the data (decision block 72). For example, in one embodiment, a hole in the load/store pipeline that accesses the data cache may be required to forward data. When such a hole is provided, the forwarding may be scheduled. The control unit 40 may read the entry in the buffer 26, and the buffer 26 may transfer data from the buffer entry to the core 22 over the Data In interface (block 74). Additionally, the control unit 40 may also transmit the buffer ID to the LSQ 32. The LSQ 32 may compare the buffer ID to the stored buffer IDs, and the control unit 42 may cause the oldest load that matches the buffer ID to be forwarded. The register address from the oldest load may be output by the LSQ 32 to the forwarding hardware in the core 22. Additionally, in some embodiments, byte selection controls may be forwarded to identify which bytes from the buffer 26 are to be forwarded to the register destination (e.g. based on the address of the load being forwarded and the size of the load). The control unit 42 may delete the load request from the LSQ 32. Additionally, the load/store unit 32 may signal the number of loads that matched the buffer ID (including the one that was selected for forwarding) (block 76).
If the number of loads indicated by the control unit 42 is 1 (i.e. the oldest load is the only load), then the forwarding is complete and the control unit 40 may delete the request from the memory request buffer 26 (decision block 78, “yes” leg and block 80). On the other hand, if the number of loads indicated by the control unit is not 1 (decision block 78, “no” leg), the control unit 40 may attempt to schedule another forwarding of the data. The data may thus be forwarded a number of times equal to the number of loads that were merged into the entry. In some embodiments, the control unit 40 may determine the number loads after each forwarding attempt. In other embodiments, the number of loads may be determined at the first forwarding, and the control unit 42 may also record a list of the entry numbers in the LSQ 32 to be forwarded to. As the forwards are scheduled by the control unit 40, the control unit 42 may forward each entry according to the relative ages of the entries. Each forward may occur during a different clock cycle, or two or more forwards may be performed in parallel, in some embodiments, if forwarding hardware is provided to perform the forwards in parallel. Alternatively, the youngest entry in the LSQ 32 that matches the buffer ID may be marked, and when the control unit 40 may continue scheduling the forwarding of data until the marked entry is forwarded. Then, the forwarding for that merged load is complete and the buffer entry may be invalidated.
As mentioned previously, the transmission of the byte enables may be delayed from the initiation of the transaction for a set of merged load requests, to permit additional merging, in some embodiments.
At a first point in time, the transaction to read the bytes accessed by the merged loads is transmitted (block 92). The address of the transaction is transmitted, but the byte enables (or other indication of the requested bytes) is delayed until a later point in time (block 94). Byte enable transmission may be delayed in a variety of ways. For example, an additional command may be transmitted on the interconnect 20 to transmit the byte enables (block 94), after the command to initiate the transaction (block 92). Alternatively, the transaction may be defined to transmit the byte enables at a later time (e.g. in response to a signal from the target of the request, or at a predetermined delay from the initiation of the transaction). Sideband signals may also be used to transmit the byte enables, rather than transmitting them on the interconnect 20. Subsequent to transmitting the byte enables, the data is returned (block 96).
Additional load merging may be permitted up until the byte enables are transmitted, even though the transaction to read the bytes has been initiated (arrow 98). Subsequent to transmission of the byte enables, load merging may not be permitted (arrow 100). Optionally, load merging may be permitted if the byte enables that would be set be a load are already set in the byte enables that were transmitted for the transaction.
Numerous variations and modifications will become apparent to those skilled in the art once the above disclosure is fully appreciated. It is intended that the following claims be interpreted to embrace all such variations and modifications.