Embodiment of the invention relate to array structure and port structure of computer memory system that can handle two load operations concurrently.
A computer system may be divided into three basic blocks: a central processing unit (CPU), memory, and input/output (I/O) units. These blocks are coupled to each other by a bus. An input device, such as a keyboard, mouse, stylus, analog-to-digital converter, etc., is used to input instructions and data into the computer system via an I/O unit. These instructions and data can be stored in memory. The CPU receives the data stored in the memory and processes the data as directed by a set of instructions. The results can be stored back into memory or outputted via the I/O unit to an output device, such as a printer, a display unit (CRT or LCD) display, digital-to-analog converter, etc.
The CPU receives data from memory as a result of performing load operations. Each load operation is typically initiated in response to a load instruction. The load instruction specifies an address to the location in the memory at which the desired data is stored. The load instruction also specifies the amount of data that is desired. Using the address and the amount of data specified, the memory may be accessed and the desired data obtained.
Data is stored back into memory as a result of the computer system performing a store operation. A store operation includes an address calculation and a data calculation. The address calculation generates the address of the memory location at which the data is going to be stored. The data calculation produces the data that is going to be stored at the address generated in the address calculation portion of the store operation. These two calculations are performed by different hardware in the computer system and require different resources. In the prior art, a processor upon receiving the store operation produces two micro-operations, referred to as the store data (STD) and the store address (STA) operations. These micro-operations correspond to the data calculation and address calculation sub-operations of the store operation respectively. The processor then executes the STD and STA operations separately. Upon completion of the execution of the STD and STA operations, their results are combined and ready for dispatch to a cache memory or a main memory.
Some computer systems have the capabilities to execute instructions out-of-order. In other words, the CPU in a computer system is capable of executing one instruction before a previously issued instruction is completed. Special considerations exist with respect to performing memory operations out-of-order in a computer system. In prior art, a store array and a load array are incorporated in a computer system as part of the solution to resolve data dependency conflicts that occurs during out-of-order execution. A load array contains information associated with load operations; a store array contains information associated with store operations dispatched from instruction fetch unit.
Memory access operations, for example the load and store operations described above, are among the biggest performance bottleneck in a computer system. Slow memory access can penalize the performance of computer systems severely. Attempt to improve the computer system by various enhancement features might fail, if performed without sufficient memory bandwidth for their support.
The present invention is illustrated by way of example and is not limited by the figures of the accompanying drawings, in which like references indicate similar elements, and in which:
Embodiments of a method and apparatus for computer memory system are described. In the following description, numerous specific details are set forth. However, it is understood that embodiments may be practiced without these specific details. In other instances, well-known elements, specifications, and protocols have not been discussed in detail in order to avoid obscuring the present invention.
A memory execution unit is a part of an execution unit that responsible to execute various memory access operations (e.g., load and store operations) in a processor. The memory execution unit receives load and store operations from a scheduler and executes them to complete the memory access operations. In one embodiment, a memory execution unit comprises a load array, a store array, a translation lookaside buffer, and a data cache. The components communicate with each others through ports. Each port may include control signals, data signals, and/or status signals. In one embodiment, dispatching an operation means sending in any combination of the following: the address or addresses of the operands, status information of the operation, code associated with the operation, code indicating operands for the operations, etc. The implementations of different port structure designs can determine the memory bandwidth available between the scheduler and the data cache.
Using a new port structure design to increase memory bandwidth triggers various physical design considerations (e.g., area of design) as well as performance considerations. Balancing between the two factors is important to ensure that the area of the design is kept within a manageable size and still enables the design to enjoy the performance benefit by having additional bandwidth accessing a data cache.
The memory unit 110 is coupled to the system bus. The bus controller 101 is coupled to bus 111. The bus controller 101 is also coupled to data cache memory 106 and instruction fetch and issue unit 102. The instruction fetch and issue unit 102 is also coupled to execution core 104. The execution core 104 is also coupled to data cache memory 106. In this embodiment, instruction fetch and issue unit 102, execution core 104, bus controller 101, and data cache memory 106 together constitute parts of processing mean 100. In this embodiment, elements 101-106 cooperate to fetch, issue, execute and save the execution results of instructions in a pipelined manner.
The instruction fetch and issue unit 102 fetches instructions from an external memory, such as memory unit 110, through the bus controller 101 via bus 111, or any other external bus. The fetched instructions are stored in instruction cache 102. The bus controller 101 manages cache coherency transfers. The instruction fetch and issue unit 102 issues these instructions in order to execution core 104. The execution core 104 performs arithmetic and logic operations, such functions as add, subtract, logical AND, and integer multiply, as well as memory operations. In one embodiment, execution core 104 also includes memory execution unit 105 that holds, executes and dispatches load and store operations to data cache memory 106 (as well as external memory) as soon as their operand dependencies on execution results of preceding instructions are resolved.
Bus controller 101, bus 111, and memory 110 are intended to represent a broad category of these elements found in most computer systems. Their functions and constitutions are well-known and will not be described further. The execution core 104, incorporating with an embodiment of the present invention, and the data cache memory 106 will be described further in detail below with additional references to the remaining figures.
In one embodiment, address generation unit X 201 is coupled to even entries array 211, arbiter 220, arbiter 222, and store array 213 via linear address port X 204. Address generation unit Y 202 is coupled to odd entries array 212, arbiter 221, arbiter, 222, and store array 213 via linear address port Y 205. Data calculation unit 203 is coupled to store array 213 via port Z 206 to provide data corresponding to store operations.
In this embodiment, even entries array 211 is coupled to arbiter 220, odd entries array 212 is coupled to arbiter 221. Store array 213 is coupled to arbiter 222. Arbiter 220, arbiter 221, and arbiter 222 are coupled to TLB 213 via load port X 223, load port Y 224, and STA port 225 respectively. In addition to that, store array is also coupled to TLB 231 and data array 252 through Store port 226.
In one embodiment, tag array 251 of data cache 250 is coupled to TLB 231 through three physical address ports (i.e. physical address port X 234, physical address port Y 235, and physical address port store 236). In one embodiment, data array 252 of data cache 250 can be coupled to a plurality of registers (e.g., 255, 256) to write the results of load operations using write back port X 254 and write back port Y 253. The physical address ports (e.g. 234, 235, and 236) for accessing data cache 250 are important to increase the bandwidth accessing data cache 250.
In one embodiment, load array 210 and store array are used to store in-flight load operations and store operations that have not been retired in a pipeline. In one embodiment, load array 210 and store array 213 are used in an out-of-order micro-architecture to resolve data dependency conflict such as read-after-write (RAW) data conflict. Moreover, for the purpose of load consistency and memory reordering, the memory operations are maintained to a late point of the retirement stage in some embodiment to conform to the conventional X86 architecture. Scheduler 200 dispatches the operations into the memory system when all sources of data required are ready.
In one embodiment, address generation unit X 201 and address generation unit Y 202 calculate linear addresses of load operations and store operations. Load operations and store operations can be dispatched using either address generation unit X 201 or address generation unit Y 202. The two ports (i.e. 204, 205) are shared to dispatch addresses for load operations and store operations. In one embodiment, the scheduler 200 uses a load balancing algorithm to attempt to have the two ports be used equally by all the memory operations (including load and store operations).
In one embodiment, a load operation is allocated to an address generation unit (either address generation unit X 201 or address generation unit Y 202). In one embodiment, load array 210 is split into two arrays, namely even entries array 211 and odd entries array 212. Each array has a single write port. If a load operation is allocated to address generation unit X 201, the entry of the operation is dispatched through linear address port X 204. In one embodiment, a specific set of conditions (e.g., blocking status condition, address conflict information and prioritization information) is used to determine whether a load operation is allowed to continue in execution. If the load operation is blocked from immediate execution, it is stored in even entries array.
On the other hand, if a load operation entry is allocated to address generation unit Y 202, the entry of the operation is dispatched through linear address port Y 205. If the load operation is blocked based on conditions as described above, the entry of the operation is stored in odd entries array 211.
In one embodiment, scheduler 200 binds store operations to either of the ports (i.e., 204, 205) based on a load balancing algorithm. Addresses for store operations are dispatched via linear address port 204 and linear address port 205. Store operations, if blocked, are stored in store array 213. Addresses for store operations are dispatched to linear address port X 204 or linear address port Y 205 regardless of their location in the store array 213. In one embodiment, store array 213 is dual ported and two addresses can be written thereto from address generation unit X 201 and address generation unit Y 202 during a clock cycle.
In one embodiment, arbiter 222 selects store addresses from linear address port 204, linear address port 205, and store array 213 to send the addresses for store operations to TLB 231 via STA port 225. Physical addresses for store operations are subsequently dispatched from TLB 231 to data cache 250 using a dedicated port: physical address (PA) store port 236. In one embodiment, store array 213 has a dedicate port 206 to receive store data from data calculation unit 203. Data for stores operations are sent into TLB 213 and to data cache 250 via store port 226.
Load operations are dispatched from load array 210 to TLB 231 with two dedicated ports (i.e., load port X 223 and load port Y 224). All load operations dispatched from address generation unit X 201 or stored in even entries array 211 are dispatched on load port X 223. Arbiter 220 selects one load operation at a time from even entries array 211 and linear address port X 204 of scheduler 200. All load operations in odd entries array 212 are dispatched on load port Y 224. Arbiter 221 selects one load operation at a time from odd entries array 212 and linear address port Y 205 of scheduler 200. The load array 210 therefore has two read ports, one for each half of load array 210.
TLB 231 includes three ports (load port X 223, load port Y 224, and STA port 225) to receive addresses from the arbiters (220, 221, and 222). Each of the ports is a non-shared port (not being shared between store and load operations) and each port is connected to specific hardware implementations. In one embodiment, TLB 213 translates a linear address into a physical address in a manner well-known in the art. A linear address comprises two parts, a page reference and an offset. A physical address comprises of two parts, which is a page address and an offset. The generated physical addresses are sent to data cache 250 via physical address port X 234, physical address port Y 235, and physical address store port 236.
In one embodiment, data cache 250 can handle two load operations and one store operation in every clock. Tag array 251 and data array 252 are triple ported. Tag array 251 contains the address and state of each line stored in the data array 252. To serve two load operations and one store operation in every clock cycle, tag array 251 has three physical ports. The ports are non-shared ports. Data array 252 contains data portion of copies of lines of main memory. Structure of data array 252 will be described further in detail below with addition references to remaining figures. In one embodiment, register 255 and register 256 are coupled to receive results from data array 252 via write back port X 254 and write back port 253. In one embodiment, write back port X sends the results of load operations dispatched through address generation unit X 201, while write back port Y sends the results of load operations dispatched through address generation unit Y 202.
To handle two load operations and one store operation in one clock cycle, the data array implements a bank conflict check (not shown in figure) between the two load operations in which the two load operations can complete only if they are trying to access different memory banks. In one embodiment, load operations that cannot be completed because of memory bank conflict will be re-dispatched or replayed. Two addresses are sent to each memory bank using either port X 310 or port Y 311. A multiplexer (e.g., 320) in each memory bank selects one of the addresses. This address is decoded and the data is read from the location referenced by this address in all the ways in the memory bank. In one embodiment, each memory banks comprises 8 ways (not shown in the figure). In other embodiment, the memory banks can comprises different number of ways. Way-select multiplexers (e.g., 321) select one of the ways and subsequently drive the resultant data from a load operation to the write back bus 312. The write back bus 312 is coupled to write back port X 253 and write back port Y 254 of
The processor 705 may have any number of processing cores. Other embodiments of the invention, however, may be implemented within other devices within the system or distributed throughout the system in hardware, software, or some combination thereof.
The main memory 710 may be implemented in various memory sources, such as dynamic random-access memory (DRAM), a hard disk drive (HDD) 720, or a memory source located remotely from the computer system via network interface 730 or via wireless interface 740 containing various storage devices and technologies. The cache memory may be located either within the processor or in close proximity to the processor, such as on the processor's local bus 707. Furthermore, the cache memory may contain relatively fast memory cells, such as a six-transistor (6T) cell, or other memory cell of approximately equal or faster access speed.
Other embodiments of the invention, however, may exist in other circuits, logic units, or devices within the system of
Similarly, at least one embodiment may be implemented within a point-to-point computer system.
The system of
Other embodiments of the invention, however, may exist in other circuits, logic units, or devices within the system of
Whereas many alterations and modifications of the present invention will no doubt become apparent to a person of ordinary skill in the art after having read the foregoing description, it is to be understood that any particular embodiment shown and described by way of illustration is in no way intended to be considered limiting. Therefore, references to details of various embodiments are not intended to limit the scope of the claims which in themselves recite only those features regarded as essential to the invention.