Embodiments of the invention described herein relate generally to the field of processing logic, microprocessors, and associated instruction set architecture that, when executed by the processor or other processing logic, perform logical, mathematical, or other functional operations. In particular, the disclosure relates to efficient data movement operations between processor cores.
Modern multi-core processors typically include multi-level cache hierarchy and coherence protocols that are intended to bring data closer to processing core(s), hoping to reduce latency by taking advantage of temporal and/or spatial locality in subsequent accesses. By storing data in caches, performance is improved when subsequent accesses reuse the same data (temporal locality) or reference nearby data (spatial locality) as previous accesses. To avoid conflicts, elaborate coherence protocols are often used to keep data consistent between the cores. In some cases, however, these coherence protocols unintentionally cause data that is shared between a producer core and a consumer core to bounce around between them, which reduces data locality and thereby decreases performance. For example, the coherence protocol may unnecessarily transfer the ownership of a cache line containing the data from the producer to the consumer when in fact, it is only the data that needs to be transferred. By changing ownership of the cache line, its locality with respect to the producer is altered. This change not only causes a performance degradation in subsequent accesses of the data by the producer, but also incurs additional penalties in power consumption, connection bandwidth, and cache space.
The invention may best be understood by referring to the following description and accompanying drawings that are used to illustrate embodiments of the invention. In the drawings:
Embodiments of system, method, and processor for detecting repetitive data accesses made by read snapshot instructions and responsively loading data into the local cache are described herein. In the following description, numerous specific details are set forth to provide a thorough understanding of embodiments of the invention. One skilled in the relevant art will recognize, however, that the invention can be practiced without one or more of the specific details, or with other methods, components, materials, etc. In other instances, well-known structures, materials, or operations are not shown or described in detail to avoid obscuring aspects of the invention.
Reference throughout this specification to “one embodiment” or “an embodiment” means that a particular feature, structure, or characteristic described in connection with the embodiment is included in at least one embodiment of the present invention. Thus, the appearances of the phrases “in one embodiment” or “in an embodiment” in various places throughout this specification are not necessarily all referring to the same embodiment. Furthermore, the particular features, structures, or characteristics may be combined in any suitable manner in one or more embodiments. For clarity, individual components in the Figures herein may be referred to by their labels in the Figures, rather than by a particular reference number.
Data movement services between producers and consumers (e.g., cores in a processor) move copies of data between them through read and write operations. For optimal performance, producers and consumers save copies of the data in their local caches based on locality considerations. As producers and consumers update and seek data, these locality considerations are altered. For example, in some coherent memory architectures, the act of reading or updating data changes the data's coherence state and position in a cache hierarchy. In other situations, the messaging system running on a single cache-coherent node (e.g., a processor) performs unintended transfer of the producer's cache line ownership to the consumer, when in fact, it is only the data that needed to be transferred. These changes in locality consideration not only degrade performance, but also consume power, coherency bandwidth, and cache space.
To address performance degradations due to unnecessary and/or unintended data movements, a read snapshot instruction, or snapshot instruction for short, was realized. The snapshot instruction is a specialized data movement instruction that moves a copy of the requested data from the source to the destination without altering the cache state or the directory state anywhere within the cache-coherent domain. The data being returned by the snapshot instruction is a snapshot of the content at the memory address specified by the snapshot instruction at the time when the read is performed. When executed, the snapshot instruction allows data to be read or sourced in situ with respect to the caching hierarchy and coherency. For example, a consumer, by executing a snapshot instruction, can read data from a producer without causing a change in the coherence state and/or the location of the cache line containing the data. The performance benefits of the snapshot instruction is two-fold. First, because the snapshot instruction does not cache the data into the consumer's cache, it does not displace data whose absence from the consumer's cache would reduce performance (i.e. reduce capacity misses). Second, because the snapshot instruction does not change the cache coherence state of the original cache line, a producer having write access to the original cache line will not lose that access and thus does not have to pay the cost of reacquiring access in future writes (i.e., read-for-ownership cost).
A typical producer/consumer data exchange using a read/load instruction is illustrated in
Referring to
In any case, a copy of cache line 132 is provided to processor 2120 in operation 5, via a response to processor 2's read miss. This copy of cache line 132 is saved into the local cache 122 as cache line copy 136 and marked as (S)hared. To summarize, as a result of the read issued by processor 2120, the cache coherence state of cache line 132 in producer's local cache 112 is changed (e.g., M→S or E→S), and a copy of the cache line 136 is cached into the consumer's local cache 122.
In contrast,
When used appropriately, the snapshot instruction can significantly reduce performance penalties incurred from unnecessary cache line bouncing between the producer and consumer caused by cache coherence protocols. However, in some scenarios, due to factors such as programming error, misunderstanding of usage, and/or inability to access an underlying library, the snapshot instruction is used repetitively to read the same cache line over and over again. In these cases, the cumulative cost of the cross-core accesses, such as those illustrated in
In general, the snapshot instruction should be called only in situations where it provides performance benefits over a conventional read/load instruction. As mentioned above, when incorrectly used, repeat executions of the snapshot instruction to access data can negate most, if not all, of the benefits sought to be provided by the snapshot instruction in the first place. As such, aspects of the present invention are directed to minimize the occurrences of such performance inversion by proactively detecting repetitive accesses made by the incorrect use of the snapshot instructions and responsively storing the data into a local cache so that it can be used to satisfy future accesses. A subsequent snapshot instruction to read the same data would be filled with the data cached in the local cache. According to an embodiment, a requested cache line is allocated and stored into the local cache as a result of a snapshot instruction upon: 1) the detection of a repetitive access pattern to the same data, and/or 2) the occurrence of an event having a predetermined probability, such as the expiration of a timer or a randomly generated number matching a pre-selected number.
According to an embodiment, processor 460 also includes one or more predictors (e.g., predictor 470) to track the data access pattern of each node by monitoring the execution of snapshot instructions within each node. While each node, as illustrated by
The behavior of the snapshot instruction may be altered in several ways. A load operation may be enabled or performed in addition to, or instead of, the snapshot instruction. In one embodiment, the load operation copies the requested cache line into the local cache (e.g., L1 or L2 cache) of the requesting core. According another embodiment, in addition to the requested cache line being copied into the local cache, it is indicated as the least-recently used (LRU) cache line such that the requested cache line is more likely to be evicted from the local cache than other cache lines in the same cache. In another embodiment, the load operation causes the issuance of a prefetch instruction which specifies the cache level into which the requested cache line is to be stored.
One of the difficulties in implementing a table to track repetitive accesses to cache lines is allocating sufficient resources for the table to capture enough accesses for determining a pattern. Generally, smaller tables, while being faster and cheaper to implement, are less effective than large ones. Sparse and yet repetitive accesses may not be easily detected when a small table is used. For example, a program may cycle repetitively through a large number of memory addresses such that the time and/or number of other accesses in between two repetitive accesses can be very large. Consequently, a small table may not be able to track enough accesses for determining a repetitive pattern because a previous access may have already been evicted from the table before the next access to the same location is recorded. For this reason, a randomized “competitive” behavior is used in addition to, or instead of, the table approach. According to an embodiment, upon detecting the issuance of a snapshot instruction, the predictor may make a random choice on whether modify the behavior of the snapshot instruction. In one embodiment, the behavior of the snapshot is modified by enabling and/or performing a load operation to bring the data into a processor's local cache.
In response to the detection, at block 803, a total instructions counter (total counter) for tracking the total number of detected snapshot instruction is incremented. At block 804, a determination is made on whether the requested cache line is already being tracked by the predictor. This may be performed by a table lookup of the memory address of the cache line. If the cache line is not currently being tracked, an entry corresponding to the cache line is created in the table at block 806 for tracking accesses to that cache line. Each of the entries in the table may include an address field to store the memory address of the cache line being tracked, and a counter to track the number of accesses to the cache line made via a snapshot instruction issued by the requesting processor. To minimize the cost of implementation, the table may be implemented as a hash table in which the memory address of each tracked cache line is hashed to determine the corresponding entry for that cache line. When a hash table is used, multiple cache lines may be tracked by a single hash table entry.
At block 808, the counter in the entry corresponding to cache line is incremented to reflect the access. At block 810, a determination is made on whether the counter, as the result of the increment, exceeds a max threshold. If the max threshold is exceeded by the counter, a load operation is enabled at block 812. According to an embodiment, enabling the load operations means performing the load operation by storing the requested cache line into a local cache of the requesting processor and updating the cache coherence state of any instances of the cache line in the cache coherent domain to (S)hared. The load operation may be performed in addition to, or instead of, the snapshot instruction. If it is determined at block 810 that the max threshold has not been exceeded by the counter as a result of the increment, another determination is made, at block 814, on whether the totals counter exceeded a totals threshold. If the totals threshold has been exceeded, the totals counter is reset at block 816 and the load operation is enabled at block 812. If the totals threshold has not been exceeded, the method ends.
An example of the present invention is a system that includes a main memory and a first processor that is communicatively coupled to the main memory and includes a first cache to store a cache line. This cache line may be associated with a cache coherence state indicating that the first cache has sole ownership of the cache line. The system further includes a second processor that is also communicatively coupled to the main memory and includes a second cache. The second processor may further include a decoder configured to decode a snapshot instruction that specifies the memory address of the cache line stored in the first cache, and an execution unit configured to execute the decoded snapshot instruction and to read data from the cache line stored in the first cache without changing the cache coherence state associated with the cache line or its location, such that the cache line is to remain in the first cache after the read. Additionally, the system includes predictor circuitry configured to track accesses to the cache line by monitoring executions of the snapshot instruction by the execution unit of the second processor. The predictor circuitry may further be configured to control enablement of a load operation based on the tracked accesses, such that an enablement of the load operation is to cause 1) a copy of the cache line to be stored into the second cache and 2) the cache coherence state of the cache line in the first cache to be changed to shared. The system may include a table to track accesses to the cache line, such that a table entry corresponding the cache line is operable to store a count of a number of accesses to the cache line made by the second processor via its execution of the snapshot instruction. The count may be incremented each time the cache line is accessed by the second processor via an execution of the snapshot instruction and may be decremented each time a predetermined amount of time has passed. The predictor circuitry may be configured to enable the load operation when the count exceeds a first threshold or when a certain number of snapshot instructions have been executed by the second processor. The predictor may also be configured to delete the table entry corresponding to the cache line when the count in the table entry falls below a second threshold. The table may be a hash table so that the memory address of the cache line is hashed in order to determine the corresponding table entry for the cache line. The enablement of the load operation may cause a copy of the cache line to be stored into the second cache as a least recently used (LRU) cache line so that it is more likely to be evicted from the second cache than other stored cache lines. The enablement of the load operation may also cause issuance of a prefetch instruction which specifies the cache level into which a copy of the cache line is to be stored. The execution of that prefetch instruction may cause a copy of the cache line to be stored into the specified cache level.
Another example of the present invention is a method that includes storing a cache line in a first cache of a first processor. The cache line may be associated with a cache coherence state indicating that the first cache has sole ownership of the cache line. The method also includes tracking accesses to the cache line by monitoring executions of a snapshot instruction made by a second processor. The snapshot instruction may specify a memory address of the cache line and an execution of the snapshot instruction may cause the second processor to read data from the cache line stored in the first cache without changing the cache coherence state associated with the cache line, and that the cache line is to remain in the first cache after the read. The method further includes controlling enablement of a load operation based on the tracked accesses, such that an enablement of the load operation is to cause 1) a copy of the cache line to be stored into a second cache of the second processor, and 2) the cache coherence state of the cache line in the first cache to be changed to shared. In addition, the method may include using a table to track accesses to the cache line, such that a table entry corresponding the cache line is to store a count of a number of accesses to the cache line made by the first processor via its executions of the snapshot instruction. The method may also include incrementing the count each time the cache line is accessed by the second processor via its execution of the snapshot instruction, and enabling the load operation when the count exceeds a first threshold and/or when a certain number of snapshot instructions have been executed by the second processor. Additionally, the method may include decrementing the count each time a predetermined amount of time has passed and deleting the table entry corresponding to the cache line when the count falls below a second threshold. Furthermore, the method may include hashing the memory address of the cache line to determine the corresponding table entry for the cache line. The method also may include storing the copy of the cache line into the second cache as least recently used (LRU) so that the copy of the cache line is more likely to be evicted from the second cache than other cache lines in the second cache. The method may include issuing a prefetch instruction which specifies a cache level into which the copy of the cache line is to be stored, and executing that prefetch instruction to store the copy of the cache line into the specified cache level.
Yet another example of the present invention is an apparatus, such as a processor, that includes a first processor core, a second processor core, and predictor circuitry. The first processor includes a first cache to store a cache line which is associated with a cache coherence state indicating that the first cache has sole ownership of the cache line. The second processor includes a second cache, a decoder configured to decode a snapshot instruction which specifies a memory address of the cache line, and an execution unit configured to execute the decoded snapshot instruction and to read data from the cache line stored in the first cache, without changing the cache coherence state associated with the cache line or its location, such that the cache line is to remain in the first cache after the read. The predictor circuitry may be configured to track accesses to the cache line by monitoring executions of the snapshot instruction by the execution unit of the second processor. The predictor circuitry may also be configured to control enablement of a load operation based on the tracked accesses, such that an enablement of the load operation is to cause 1) a copy of the cache line to be stored into the second cache and 2) the cache coherence state of the cache line in the first cache to be changed to shared. The apparatus may include a table to track accesses to the cache line, such that a table entry corresponding the cache line is operable to store a count of a number of accesses to the cache line made by the second processor via its execution of the snapshot instruction. The count may be incremented each time the cache line is accessed by the second processor via an execution of the snapshot instruction and may be decremented each time a predetermined amount of time has passed. The predictor circuitry may be configured to enable the load operation when the count exceeds a first threshold or when a certain number of snapshot instructions have been executed by the second processor. The predictor may also be configured to delete the table entry corresponding to the cache line when the count in the table entry falls below a second threshold. The table may be a hash table so that the memory address of the cache line is hashed in order to determine the corresponding table entry for the cache line. The enablement of the load operation may cause a copy of the cache line to be stored into the second cache as a least recently used (LRU) cache line so that it is more likely to be evicted from the second cache than other stored cache lines. The enablement of the load operation may also cause issuance of a prefetch instruction which specifies the cache level into which a copy of the cache line is to be stored. The execution of that prefetch instruction may cause a copy of the cache line to be stored into the specified cache level.
In
The front end hardware 1030 includes a branch prediction hardware 1032 coupled to an instruction cache hardware 1034, which is coupled to an instruction translation lookaside buffer (TLB) 1036, which is coupled to an instruction fetch hardware 1038, which is coupled to a decode hardware 1040. The decode hardware 1040 (or decoder) may decode instructions, and generate as an output one or more micro-operations, micro-code entry points, microinstructions, other instructions, or other control signals, which are decoded from, or which otherwise reflect, or are derived from, the original instructions. The decode hardware 1040 may be implemented using various different mechanisms. Examples of suitable mechanisms include, but are not limited to, look-up tables, hardware implementations, programmable logic arrays (PLAs), microcode read only memories (ROMs), etc. In one embodiment, the core 1090 includes a microcode ROM or other medium that stores microcode for certain macroinstructions (e.g., in decode hardware 1040 or otherwise within the front end hardware 1030). The decode hardware 1040 is coupled to a rename/allocator hardware 1052 in the execution engine hardware 1050.
The execution engine hardware 1050 includes the rename/allocator hardware 1052 coupled to a retirement hardware 1054 and a set of one or more scheduler hardware 1056. The scheduler hardware 1056 represents any number of different schedulers, including reservations stations, central instruction window, etc. The scheduler hardware 1056 is coupled to the physical register file(s) hardware 1058. Each of the physical register file(s) hardware 1058 represents one or more physical register files, different ones of which store one or more different data types, such as scalar integer, scalar floating point, packed integer, packed floating point, vector integer, vector floating point, status (e.g., an instruction pointer that is the address of the next instruction to be executed), etc. In one embodiment, the physical register file(s) hardware 1058 comprises a vector registers hardware, a write mask registers hardware, and a scalar registers hardware. This register hardware may provide architectural vector registers, vector mask registers, and general purpose registers. The physical register file(s) hardware 1058 is overlapped by the retirement hardware 1054 to illustrate various ways in which register renaming and out-of-order execution may be implemented (e.g., using a reorder buffer(s) and a retirement register file(s); using a future file(s), a history buffer(s), and a retirement register file(s); using a register maps and a pool of registers; etc.). The retirement hardware 1054 and the physical register file(s) hardware 1058 are coupled to the execution cluster(s) 1060. The execution cluster(s) 1060 includes a set of one or more execution hardware 1062 and a set of one or more memory access hardware 1064. The execution hardware 1062 may perform various operations (e.g., shifts, addition, subtraction, multiplication) and on various types of data (e.g., scalar floating point, packed integer, packed floating point, vector integer, vector floating point). While some embodiments may include a number of execution hardware dedicated to specific functions or sets of functions, other embodiments may include only one execution hardware or multiple execution hardware that all perform all functions. The scheduler hardware 1056, physical register file(s) hardware 1058, and execution cluster(s) 1060 are shown as being possibly plural because certain embodiments create separate pipelines for certain types of data/operations (e.g., a scalar integer pipeline, a scalar floating point/packed integer/packed floating point/vector integer/vector floating point pipeline, and/or a memory access pipeline that each have their own scheduler hardware, physical register file(s) hardware, and/or execution cluster—and in the case of a separate memory access pipeline, certain embodiments are implemented in which only the execution cluster of this pipeline has the memory access hardware 1064). It should also be understood that where separate pipelines are used, one or more of these pipelines may be out-of-order issue/execution and the rest in-order.
The set of memory access hardware 1064 is coupled to the memory hardware 1070, which includes a data TLB hardware 1072 coupled to a data cache hardware 1074 coupled to a level 2 (L2) cache hardware 1076. In one exemplary embodiment, the memory access hardware 1064 may include a load hardware, a store address hardware, and a store data hardware, each of which is coupled to the data TLB hardware 1072 in the memory hardware 1070. The instruction cache hardware 1034 is further coupled to a level 2 (L2) cache hardware 1076 in the memory hardware 1070. The L2 cache hardware 1076 is coupled to one or more other levels of cache and eventually to a main memory.
By way of example, the exemplary register renaming, out-of-order issue/execution core architecture may implement the pipeline 1000 as follows: 1) the instruction fetch 1038 performs the fetch and length decoding stages 1002 and 1004; 2) the decode hardware 1040 performs the decode stage 1006; 3) the rename/allocator hardware 1052 performs the allocation stage 1008 and renaming stage 1010; 4) the scheduler hardware 1056 performs the schedule stage 1012; 5) the physical register file(s) hardware 1058 and the memory hardware 1070 perform the register read/memory read stage 1014; the execution cluster 1060 perform the execute stage 1016; 6) the memory hardware 1070 and the physical register file(s) hardware 1058 perform the write back/memory write stage 1018; 7) various hardware may be involved in the exception handling stage 1022; and 8) the retirement hardware 1054 and the physical register file(s) hardware 1058 perform the commit stage 1024.
The core 1090 may support one or more instructions sets (e.g., the x86 instruction set (with some extensions that have been added with newer versions); the MIPS instruction set of MIPS Technologies of Sunnyvale, Calif.; the ARM instruction set (with optional additional extensions such as NEON) of ARM Holdings of Sunnyvale, Calif.), including the instruction(s) described herein. In one embodiment, the core 1090 includes logic to support a packed data instruction set extension (e.g., AVX1, AVX2, and/or some form of the generic vector friendly instruction format (U=0 and/or U=1), described below), thereby allowing the operations used by many multimedia applications to be performed using packed data.
It should be understood that the core may support multithreading (executing two or more parallel sets of operations or threads), and may do so in a variety of ways including time sliced multithreading, simultaneous multithreading (where a single physical core provides a logical core for each of the threads that physical core is simultaneously multithreading), or a combination thereof (e.g., time sliced fetching and decoding and simultaneous multithreading thereafter such as in the Intel® Hyperthreading technology).
While register renaming is described in the context of out-of-order execution, it should be understood that register renaming may be used in an in-order architecture. While the illustrated embodiment of the processor also includes separate instruction and data cache hardware 1034/1074 and a shared L2 cache hardware 1076, alternative embodiments may have a single internal cache for both instructions and data, such as, for example, a Level 1 (L1) internal cache, or multiple levels of internal cache. In some embodiments, the system may include a combination of an internal cache and an external cache that is external to the core and/or the processor. Alternatively, all of the cache may be external to the core and/or the processor.
Thus, different implementations of the processor 1100 may include: 1) a CPU with the special purpose logic 1108 being integrated graphics and/or scientific (throughput) logic (which may include one or more cores), and the cores 1102A-N being one or more general purpose cores (e.g., general purpose in-order cores, general purpose out-of-order cores, a combination of the two); 2) a coprocessor with the cores 1102A-N being a large number of special purpose cores intended primarily for graphics and/or scientific (throughput); and 3) a coprocessor with the cores 1102A-N being a large number of general purpose in-order cores. Thus, the processor 1100 may be a general-purpose processor, coprocessor or special-purpose processor, such as, for example, a network or communication processor, compression engine, graphics processor, GPGPU (general purpose graphics processing unit), a high-throughput many integrated core (MIC) coprocessor (including 30 or more cores), embedded processor, or the like. The processor may be implemented on one or more chips. The processor 1100 may be a part of and/or may be implemented on one or more substrates using any of a number of process technologies, such as, for example, BiCMOS, CMOS, or NMOS.
The memory hierarchy includes one or more levels of cache within the cores, a set or one or more shared cache hardware 1106, and external memory (not shown) coupled to the set of integrated memory controller hardware 1114. The set of shared cache hardware 1106 may include one or more mid-level caches, such as level 2 (L2), level 3 (L3), level 4 (L4), or other levels of cache, a last level cache (LLC), and/or combinations thereof. While in one embodiment a ring based interconnect hardware 1112 interconnects the integrated graphics logic 1108, the set of shared cache hardware 1106, and the system agent hardware 1110/integrated memory controller hardware 1114, alternative embodiments may use any number of well-known techniques for interconnecting such hardware. In one embodiment, coherency is maintained between one or more cache hardware 1106 and cores 1102-A-N.
In some embodiments, one or more of the cores 1102A-N are capable of multithreading. The system agent 1110 includes those components coordinating and operating cores 1102A-N. The system agent hardware 1110 may include for example a power control unit (PCU) and a display hardware. The PCU may be or include logic and components needed for regulating the power state of the cores 1102A-N and the integrated graphics logic 1108. The display hardware is for driving one or more externally connected displays.
The cores 1102A-N may be homogenous or heterogeneous in terms of architecture instruction set; that is, two or more of the cores 1102A-N may be capable of execution the same instruction set, while others may be capable of executing only a subset of that instruction set or a different instruction set. In one embodiment, the cores 1102A-N are heterogeneous and include both the “small” cores and “big” cores described below.
Referring now to
The optional nature of additional processors 1215 is denoted in
The memory 1240 may be, for example, dynamic random access memory (DRAM), phase change memory (PCM), or a combination of the two. For at least one embodiment, the controller hub 1220 communicates with the processor(s) 1210, 1215 via a multi-drop bus, such as a frontside bus (FSB), point-to-point interface, or similar connection 1295.
In one embodiment, the coprocessor 1245 is a special-purpose processor, such as, for example, a high-throughput MIC processor, a network or communication processor, compression engine, graphics processor, GPGPU, embedded processor, or the like. In one embodiment, controller hub 1220 may include an integrated graphics accelerator.
There can be a variety of differences between the physical resources 1210, 1215 in terms of a spectrum of metrics of merit including architectural, microarchitectural, thermal, power consumption characteristics, and the like.
In one embodiment, the processor 1210 executes instructions that control data processing operations of a general type. Embedded within the instructions may be coprocessor instructions. The processor 1210 recognizes these coprocessor instructions as being of a type that should be executed by the attached coprocessor 1245. Accordingly, the processor 1210 issues these coprocessor instructions (or control signals representing coprocessor instructions) on a coprocessor bus or other interconnect, to coprocessor 1245. Coprocessor(s) 1245 accept and execute the received coprocessor instructions.
Referring now to
Processors 1370 and 1380 are shown including integrated memory controller (IMC) hardware 1372 and 1382, respectively. Processor 1370 also includes as part of its bus controller hardware point-to-point (P-P) interfaces 1376 and 1378; similarly, second processor 1380 includes P-P interfaces 1386 and 1388. Processors 1370, 1380 may exchange information via a point-to-point (P-P) interface 1350 using P-P interface circuits 1378, 1388. As shown in
Processors 1370, 1380 may each exchange information with a chipset 1390 via individual P-P interfaces 1352, 1354 using point to point interface circuits 1376, 1394, 1386, 1398. Chipset 1390 may optionally exchange information with the coprocessor 1338 via a high-performance interface 1339. In one embodiment, the coprocessor 1338 is a special-purpose processor, such as, for example, a high-throughput MIC processor, a network or communication processor, compression engine, graphics processor, GPGPU, embedded processor, or the like.
A shared cache (not shown) may be included in either processor or outside of both processors, yet connected with the processors via P-P interconnect, such that either or both processors' local cache information may be stored in the shared cache if a processor is placed into a low power mode.
Chipset 1390 may be coupled to a first bus 1316 via an interface 1396. In one embodiment, first bus 1316 may be a Peripheral Component Interconnect (PCI) bus, or a bus such as a PCI Express bus or another third generation I/O interconnect bus, although the scope of the present invention is not so limited.
As shown in
Referring now to
Referring now to
Embodiments of the mechanisms disclosed herein may be implemented in hardware, software, firmware, or a combination of such implementation approaches. Embodiments of the invention may be implemented as computer programs or program code executing on programmable systems comprising at least one processor, a storage system (including volatile and non-volatile memory and/or storage elements), at least one input device, and at least one output device.
Program code, such as code 1330 illustrated in
The program code may be implemented in a high level procedural or object oriented programming language to communicate with a processing system. The program code may also be implemented in assembly or machine language, if desired. In fact, the mechanisms described herein are not limited in scope to any particular programming language. In any case, the language may be a compiled or interpreted language.
One or more aspects of at least one embodiment may be implemented by representative instructions stored on a machine-readable medium which represents various logic within the processor, which when read by a machine causes the machine to fabricate logic to perform the techniques described herein. Such representations, known as “IP cores” may be stored on a tangible, machine readable medium and supplied to various customers or manufacturing facilities to load into the fabrication machines that actually make the logic or processor.
Such machine-readable storage media may include, without limitation, non-transitory, tangible arrangements of articles manufactured or formed by a machine or device, including storage media such as hard disks, any other type of disk including floppy disks, optical disks, compact disk read-only memories (CD-ROMs), compact disk rewritable's (CD-RWs), and magneto-optical disks, semiconductor devices such as read-only memories (ROMs), random access memories (RAMs) such as dynamic random access memories (DRAMs), static random access memories (SRAMs), erasable programmable read-only memories (EPROMs), flash memories, electrically erasable programmable read-only memories (EEPROMs), phase change memory (PCM), magnetic or optical cards, or any other type of media suitable for storing electronic instructions.
Accordingly, embodiments of the invention also include non-transitory, tangible machine-readable media containing instructions or containing design data, such as Hardware Description Language (HDL), which defines structures, circuits, apparatuses, processors and/or system features described herein. Such embodiments may also be referred to as program products.
In some cases, an instruction converter may be used to convert an instruction from a source instruction set to a target instruction set. For example, the instruction converter may translate (e.g., using static binary translation, dynamic binary translation including dynamic compilation), morph, emulate, or otherwise convert an instruction to one or more other instructions to be processed by the core. The instruction converter may be implemented in software, hardware, firmware, or a combination thereof. The instruction converter may be on processor, off processor, or part on and part off processor.
Although some embodiments have been described in reference to particular implementations, other implementations are possible according to some embodiments. Additionally, the arrangement and/or order of elements or other features illustrated in the drawings and/or described herein need not be arranged in the particular way illustrated and described. Many other arrangements are possible according to some embodiments.
In each system shown in a figure, the elements in some cases may each have a same reference number or a different reference number to suggest that the elements represented could be different and/or similar. However, an element may be flexible enough to have different implementations and work with some or all of the systems shown or described herein. The various elements shown in the figures may be the same or different. Which one is referred to as a first element and which is called a second element is arbitrary.
In the description and claims, the terms “coupled” and “connected,” along with their derivatives, may be used. It should be understood that these terms are not intended as synonyms for each other. Rather, in particular embodiments, “connected” may be used to indicate that two or more elements are in direct physical or electrical contact with each other. “Coupled” may mean that two or more elements are in direct physical or electrical contact. However, “coupled” may also mean that two or more elements are not in direct contact with each other, but yet still co-operate or interact with each other.
An embodiment is an implementation or example of the inventions. Reference in the specification to “an embodiment,” “one embodiment,” “some embodiments,” or “other embodiments” means that a particular feature, structure, or characteristic described in connection with the embodiments is included in at least some embodiments, but not necessarily all embodiments, of the inventions. The various appearances “an embodiment,” “one embodiment,” or “some embodiments” are not necessarily all referring to the same embodiments.
Not all components, features, structures, characteristics, etc. described and illustrated herein need be included in a particular embodiment or embodiments. If the specification states a component, feature, structure, or characteristic “may”, “might”, “can” or “could” be included, for example, that particular component, feature, structure, or characteristic is not required to be included. If the specification or claim refers to “a” or “an” element, that does not mean there is only one of the element. If the specification or claims refer to “an additional” element, that does not preclude there being more than one of the additional element.
The above description of illustrated embodiments of the invention, including what is described in the Abstract, is not intended to be exhaustive or to limit the invention to the precise forms disclosed. While specific embodiments of, and examples for, the invention are described herein for illustrative purposes, various equivalent modifications are possible within the scope of the invention, as those skilled in the relevant art will recognize.
These modifications can be made to the invention in light of the above detailed description. The terms used in the following claims should not be construed to limit the invention to the specific embodiments disclosed in the specification and the drawings. Rather, the scope of the invention is to be determined entirely by the following claims, which are to be construed in accordance with established doctrines of claim interpretation.