1. Field of the Invention
This invention relates generally to the field of computer processors. More particularly, the invention relates to an apparatus and method for implementing a memory-hierarchy aware producer-consumer instruction for transferring data between cores in a processor.
2. Description of the Related Art
Referring to
The foregoing approach suffers from low latency and low bandwidth because the snoop protocol required to perform the data transfer operation is not performance-optimized as are standard read/write processor operations. An additional drawback of existing approaches is the pollution of the cache of the producer core with data it will never consume, thereby evicting data it might need in the future.
As such, a more efficient mechanism is needed for exchanging data between the cores of a CPU.
A better understanding of the present invention can be obtained from the following detailed description in conjunction with the following drawings, in which:
In the following description, for the purposes of explanation, numerous specific details are set forth in order to provide a thorough understanding of the embodiments of the invention described below. It will be apparent, however, to one skilled in the art that the embodiments of the invention may be practiced without some of these specific details. In other instances, well- known structures and devices are shown in block diagram form to avoid obscuring the underlying principles of the embodiments of the invention.
In one embodiment, when transferring data from a producer core to a consumer core within a central processing unit (CPU), the producer core will not store the data in its own L1 cache as in prior implementations. Rather, the producer core will execute an instruction to cause the data to be stored in the highest cache level common to both of the CPU cores. For example, if both the producer core and the consumer core have read/write access to the level 3 (L3) cache (also sometimes referred to as the lower level cache) then the L3 cache is used to exchange the data. Note, however, that the underlying principles of the invention are not limited to the use of any particular cache level for exchanging data.
As illustrated in
In operation, core-core producer-consumer logic 211a of the producer core 201 (Core 0 in the example) initially writes the data to be exchanged to fill buffers 251 within the CPU 250. Caches (such as the L1, L2, and L3 caches 212, 213, and 214, respectively) work in cache lines which are a fixed size (64 bytes in one particular embodiment) whereas typical store operations can vary from 1 byte to 64 bytes in size. In one embodiment, the fill buffers 251 are used to combine multiple stores until a complete cache line is filled and then the data is moved between cache levels. Thus, in the example shown in
The core-core producer-consumer logic 211a then writes a flag 225 to indicate that the data is ready for transfer. In one embodiment, the flag 225 is a single bit (e.g., with a ‘1’ indicating that the data is ready in the L3 cache). The core-core consumer-producer logic 211b of the consumer core 202 reads the flag 225 to determine that the data is ready, either through periodic polling by the core-core consumer-producer logic 211b or an interrupt. Once it learns that data is ready in the L3 cache (or other highest common cache level shared with the producer core 201), the consumer core 202 reads the data.
A method in accordance with one embodiment of the invention is illustrated in
At 301, the data to be exchanged is first stored to the fill buffers within the CPU. As mentioned, a chunk of data equal to a complete cache line may be stored within the fill buffers before initiating the data transfer between cache levels. Once the fill buffer is full (e.g., by an amount equal to a cache line) 302, an eviction cycle is generated at 303. The eviction cycle persists until the data is stored within a cache level common to both cores of the CPU, determined at 304. At 305, a flag is set by the producer core to indicate that the data is available for the consumer core and, at 306, the consumer core reads the data from the cache.
In one embodiment, the data is transferred to the fill buffers and then evicted to the L3 cache using a particular instruction, referred to herein as a MovNonAllocate (MovNA) instruction. As indicated in
Referring now to
Each processing element may be a single core or may, alternatively, include multiple cores. The processing elements may, optionally, include other on-die elements besides processing cores, such as integrated memory controller and/or integrated I/O control logic. Also, for at least one embodiment, the core(s) of the processing elements may be multithreaded in that they may include more than one hardware thread context per core.
The GMCH 420 may be a chipset, or a portion of a chipset. The GMCH 420 may communicate with the processor(s) 410, 415 and control interaction between the processor(s) 410, 415 and memory 440. The GMCH 420 may also act as an accelerated bus interface between the processor(s) 410, 415 and other elements of the system 400. For at least one embodiment, the GMCH 420 communicates with the processor(s) 410, 415 via a multi-drop bus, such as a frontside bus (FSB) 495.
Furthermore, GMCH 420 is coupled to a display 440 (such as a flat panel display). GMCH 420 may include an integrated graphics accelerator. GMCH 420 is further coupled to an input/output (I/O) controller hub (ICH) 450, which may be used to couple various peripheral devices to system 400. Shown for example in the embodiment of
Alternatively, additional or different processing elements may also be present in the system 400. For example, additional processing element(s) 415 may include additional processors(s) that are the same as processor 410, additional processor(s) that are heterogeneous or asymmetric to processor 410, accelerators (such as, e.g., graphics accelerators or digital signal processing (DSP) units), field programmable gate arrays, or any other processing element. There can be a variety of differences between the physical resources 410, 415 in terms of a spectrum of metrics of merit including architectural, microarchitectural, thermal, power consumption characteristics, and the like. These differences may effectively manifest themselves as asymmetry and heterogeneity amongst the processing elements 410, 415. For at least one embodiment, the various processing elements 410, 415 may reside in the same die package.
According to one embodiment of the invention, the exemplary architecture of the data processing system 900 may used for the mobile devices described above. The data processing system 900 includes the processing system 520, which may include one or more microprocessors and/or a system on an integrated circuit. The processing system 520 is coupled with a memory 910, a power supply 525 (which includes one or more batteries) an audio input/output 540, a display controller and display device 560, optional input/output 550, input device(s) 570, and wireless transceiver(s) 530. It will be appreciated that additional components, not shown in
The memory 510 may store data and/or programs for execution by the data processing system 500. The audio input/output 540 may include a microphone and/or a speaker to, for example, play music and/or provide telephony functionality through the speaker and microphone. The display controller and display device 560 may include a graphical user interface (GUI). The wireless (e.g., RF) transceivers 530 (e.g., a WiFi transceiver, an infrared transceiver, a Bluetooth transceiver, a wireless cellular telephony transceiver, etc.) may be used to communicate with other data processing systems. The one or more input devices 570 allow a user to provide input to the system. These input devices may be a keypad, keyboard, touch panel, multi touch panel, etc. The optional other input/output 550 may be a connector for a dock.
Other embodiments of the invention may be implemented on cellular phones and pagers (e.g., in which the software is embedded in a microchip), handheld computing devices (e.g., personal digital assistants, smartphones), and/or touch-tone telephones. It should be noted, however, that the underlying principles of the invention are not limited to any particular type of communication device or communication medium.
Embodiments of the invention may include various steps, which have been described above. The steps may be embodied in machine-executable instructions which may be used to cause a general-purpose or special-purpose processor to perform the steps. Alternatively, these steps may be performed by specific hardware components that contain hardwired logic for performing the steps, or by any combination of programmed computer components and custom hardware components.
Elements of the present invention may also be provided as a computer program product which may include a machine-readable medium having stored thereon instructions which may be used to program a computer (or other electronic device) to perform a process. The machine-readable medium may include, but is not limited to, floppy diskettes, optical disks, CD-ROMs, and magneto-optical disks, ROMs, RAMs, EPROMs, EEPROMs, magnet or optical cards, propagation media or other type of media/machine-readable medium suitable for storing electronic instructions. For example, the present invention may be downloaded as a computer program product, wherein the program may be transferred from a remote computer (e.g., a server) to a requesting computer (e.g., a client) by way of data signals embodied in a carrier wave or other propagation medium via a communication link (e.g., a modem or network connection).
Throughout this detailed description, for the purposes of explanation, numerous specific details were set forth in order to provide a thorough understanding of the present invention. It will be apparent, however, to one skilled in the art that the invention may be practiced without some of these specific details. In certain instances, well known structures and functions were not described in elaborate detail in order to avoid obscuring the subject matter of the present invention. Accordingly, the scope and spirit of the invention should be judged in terms of the claims which follow.
Filing Document | Filing Date | Country | Kind | 371c Date |
---|---|---|---|---|
PCT/US11/66630 | 12/21/2011 | WO | 00 | 6/15/2013 |