Embodiments of the invention described herein relate generally to the efficient memory accesses in a computer processing system. In particular, the disclosure relates to architecture extension for performing low-latency address translations used in direct memory accesses made by hardware subsystems.
In computing, accelerators are specialized computing devices designed to perform certain functions more efficiently than is possible by software running on a general-purpose central processing unit (CPU). For example, visualization processes may be offloaded from the CPU onto a graphics card to enable faster, higher-quality playback of videos and games. Similarly, compression and decompression workloads that are computationally intensive may be better suited for specialized encoders and decoders rather than a CPU. Efficient use of accelerators can decrease latency, increase throughput, and free up CPU utilization.
The invention may best be understood by referring to the following description and accompanying drawings that are used to illustrate embodiments of the invention. In the drawings:
Embodiments of apparatus and method for reducing memory access latency by hardware subsystems are described herein. In the following description, numerous specific details are set forth to provide a thorough understanding of embodiments of the invention. One skilled in the relevant art will recognize, however, that the invention can be practiced without one or more of the specific details, or with other methods, components, materials, etc. In other instances, well-known structures, materials, or operations are not shown or described in detail to avoid obscuring aspects of the invention.
Reference throughout this specification to “one embodiment” or “an embodiment” means that a particular feature, structure, or characteristic described in connection with the embodiment is included in at least one embodiment of the present invention. Thus, the appearances of the phrases “in one embodiment” or “in an embodiment” in various places throughout this specification are not necessarily all referring to the same embodiment. Furthermore, the particular features, structures, or characteristics may be combined in any suitable manner in one or more embodiments. For clarity, individual components in the Figures herein may be referred to by their labels in the Figures, rather than by a particular reference number.
In computing, hardware subsystems such as I/O devices, accelerators, graphics cards, and encoder/decoders are specialized computer devices designed to perform certain functions more efficiently than is possibly by software alone on a general-purpose central processing unit (CPU). CPU and software submit jobs to hardware subsystems via special instructions. For example, the enqueue command instruction, part of the Intel® instruction architecture, enables user or kernel space software applications to submit jobs to hardware subsystems via abstracted job descriptors (descriptors). The use of job descriptors hides hardware semantics from software applications which, in turn, helps simplify the job submission process.
To submit a job, a software application first constructs a standardized job descriptor in memory. The information in the job descriptor may include, for example, the job (i.e. command or workload) to be performed, the process address space identifier (PASID) of the software application or thread, the privilege level (e.g., user or supervisor), and the data required by hardware subsystem to perform the job. Once the job descriptor is constructed, the application submits the job descriptor to the hardware subsystem by invoking or calling an enqueue command instruction. The enqueue command instruction may include one or more operands for specifying information such as the location of the job descriptor in memory and the target hardware subsystem to perform the job (e.g., an identifier of the hardware subsystem or its job queue). In one embodiment, a memory-mapped I/O (MMIO) address is used to identify the target hardware subsystem or the job queue.
When the enqueue command instruction is executed by the CPU, the job in the job descriptor is copied from the memory to the job queue of the target hardware subsystem. The queue may be a register or a local cache associated the target hardware subsystem. The jobs in the job queue are then processed by the hardware subsystem. As part of the performance of the job, the hardware subsystem requests data from system memory, usually through the input-output memory management unit (IOMMU).
Extended IOMMU with PASID support offers Shared Virtual Memory (SVM) function which allows hardware subsystems to access the memory by direct memory access (DMA) using virtual addresses. The use of SVM allows software applications to submit jobs to hardware subsystems without having to convert virtual addresses into physical addresses before submission. The overhead associated with address translations, however, is passed onto the IOMMU during DMA.
In host mode, the degradation in DMA performance can usually be attributed to the need to translation IO Virtual Addresses (IOVA) and/or Host Virtual Addresses into physical addresses. In either case, costly page table walks are often required. In virtual machine environments where Guest Virtual Addresses (GVA) are used, the degradation in DMA performance in SVM mode is even more severe due to nested translations. In some cases, the translation task can consume up to 80 CPU cycles per DMA request.
Aspects of the present disclosure help reduce the latency associated with address translation in the IOMMU. According to embodiments of the present invention, the serialized datapath pipeline is restructured into two parallel pipelines—a modified datapath pipeline and a separate translation (pre-translation) pipeline. This allows the IOMMU to pre-translate and warm up the translation cache prior to receiving DMA requests from the hardware subsystem. In doing so, the impact of address translation on latency is minimized which, in turn, improves the overall system performance.
Returning to the datapath pipeline, at block 206, as the hardware subsystem processes the job descriptor from its job queue, it identifies the data required for performing the job and responsively generates one or more DMA requests using the virtual addresses of the data. At block 208, the IOMMU receives and processes the DMA requests using the address translations that are already in the IOTLB to obtain the corresponding physical address for each virtual address in the DMA request. Next, at block 210, the IOMMU access the memory using the physical addresses and provides the retrieved data to the hardware subsystem to performs the job. Since address translations typically take much less time than job submissions to the hardware subsystem, due to the latency in the PCI MMIO/bus, it is fair to assume that when the DMA requests from the hardware subsystem reach the IOMMU, the relevant address translations are already present in the IOTLB. This allows the IOMMU to access memory without the latency of page table walks associated with address translation.
Benefits provided by aspects of the present invention include eliminating the delay of IOMMU address translation from DMA operation by performing address translations in parallel with the job submission to the hardware subsystem. This is especially useful in virtualized environments where address translations are often nested and involve multiple page tables. Aspects of the present invention also increase the performance (˜80 cycles per DMA) of various hardware subsystems/devices including graphics accelerators, ethernet accelerators, crypto accelerators, data accelerators. Features of the present invention may be configurable by hypervisor software via an IOMMU interface.
In operation, a software application or thread submits a job by calling an enqueue command instruction which specifies a job descriptor 322 stored in the system memory 320. The enqueue engine 312 in the CPU, in response to the execution of the enqueue command instruction, initiates a pre-translation pipeline along with a data path pipeline. The pre-translation pipeline begins by the CPU 310 evoking the pre/parallel translation interface 342 provided by IOMMU 340 to submit a pre-translation request. The interface provided by IOMMU may be implemented as a register set or a hidden channel (i.e. side channel). Information provided in the pre-translation request may include the BDF of the hardware subsystem, PASID of the software application/thread, and/or one or more virtual addresses to be translated. The BDF and PASID may be used by the page table walk engine 346 to identify the page table from which address translations are obtained. Next, The IOMMU 340, upon receiving pre-translation request from the pre-translation pipeline, begins translating the virtual addresses. If a translation is not available locally (i.e. missing in IOTLB 344), the page table walk engine 346 searches one or more page tables to find the address translation. For example, depending on the translation mode configured for the hardware subsystem and/or the PASID, the page table walk engine may access the translation table of the hardware subsystem (i.e. second level page table) and/or the host page table (i.e. first level page table) to retrieve the desired physical address translation. Next, upon successful completion of the page table walk, the page table walk engine 346 inserts the virtual-to-physical translation into the IOTLB 344.
Concurrently with the translation pipeline, the datapath pipeline begins with the enqueue engine 312 storing the job descriptor into the job queue 332 of the hardware subsystem 330. From the job queue 332, jobs are dispatched to the hardware interface 334 to be processed by the processor 336. During the processing, one or more DMA requests containing host virtual addresses or I/O virtual addresses (IOVA) are submitted to the IOMMU or root complex to access data. Since the translations for the virtual addresses or IOVAs are likely available in IOTLB already from the pre-translation pipeline, the DMA remapping engine can quickly perform memory operations using the cached address translations without costly page table walks. It is reasonable to assume that the pre-translation pipeline will complete before the datapath pipeline because memory transactions performed in the pre-translation pipeline are inherently faster than the PCI MMIO transactions in the datapath pipeline.
The instruction fetch unit 410 may include various well known components including a next instruction pointer 403 for storing the address of the next instruction to be fetched from memory 400 (or one of the caches); an instruction translation look-aside buffer (ITLB) 404 for storing a map of recently used virtual-to-physical instruction addresses to improve the speed of address translation; a branch prediction unit 402 for speculatively predicting instruction branch addresses; and branch target buffers (BTBs) 401 for storing branch addresses and target addresses. Once fetched, instructions are streamed to the remaining stages of the instruction pipeline including the decode unit 430, the execution unit 440, and the writeback unit 450. The structure and function of each of these units is well understood by those of ordinary skill in the art and will not be described here in detail to avoid obscuring the pertinent aspects of the different embodiments of the invention.
In one embodiment, the decode unit 430 includes an enqueue command instruction decoder 431 for decoding the enqueue command instructions described herein (e.g., into sequences of micro-operations in one embodiment) and the execution unit 440 includes an enqueue command instruction execution unit 441 for executing the decoded enqueue command instructions.
The request buffer 530 pointed to by pointer 522 may store command and parameters 532 for specifying the action(s) to be taken by the target hardware subsystem. In addition, the request buffer 530 may store pointers to scattered payloads that need to be processed by the hardware subsystem. For example, request buffer 530 may store pointers 534 and 536 which are the memory addresses of where payloads 536 and 540 are stored, respectively. The CPU, as part of the translation pipeline, parses the job descriptor 500 and locates all of the addresses that require translation. For example, the CPU may determine from the job descriptor:
According to an embodiment, the virtual address of the request buffer descriptor 522 and response buffer descriptor 524 are stored in the job descriptor 500 and are thus referenced directly in the job descriptor 500. On the other hand, the virtual addresses of payloads 534 and 538 are stored in secondary descriptors (request buffer 530) and are thus indirectly referenced in job descriptor 500. According to an embodiment, all of these virtual addresses will be retrieved by the CPU and provided to the IOMMU via a sideband channel to be pre-translated. In order to parse the job descriptor for different types of hardware subsystems and devices, a standardized format may be defined for the job descriptor to be used with the enqueue command instruction. The standardized format may enable the CPU to identify both directly-referenced and indirectly-referenced virtual addresses more efficiently and accurately.
There are several benefits for using enqueue command instructions and job descriptors to submit jobs to hardware subsystems. Besides simplifying the job submission process by hiding hardware semantics from software applications as mentioned above, another benefit of using the enqueue command instruction is the automatic translation of process address space identifiers (PASIDs).
PASIDs are used to share a single hardware subsystem across multiple software threads or processes while providing each thread or process with a corresponding address space. PASID can be extended to virtualized environments through the concept of guest PASIDs (gPASID) and host PASIDs (hPASID). Virtual machines in the virtualized environment operate using guest PASIDs while the hypervisor and/or the underlying hardware operate using host PASIDs. Each task submitted by a software thread in the VM is associated with a guest PASID which must be translated into a corresponding host PASID. This translation task is typically performed by the hypervisor.
With an enqueue command instruction, the translation of guest PASID into host PASID is handled by hardware via virtual machine extensions and PASID translation tables.
Using an enqueue command instruction means that fewer PASID translations need to be performed by the hypervisor. However, enqueue command instructions are currently used mainly for submitting device-specific workloads/commands and do not support the submission of control commands, which are commands for controlling common operations shared between hardware subsystems.
In a VM environment, control commands are frequently used during VM transitions. For example, the control command PASID reset is typically triggered each time when a guest application shuts down. The purpose of the PASID reset command is to inform the hardware subsystem to go through the pending queue and remove all inflight requests associated with an application-assigned host PASID to release resources. The PASID drain command is another control command often used during live migration to instruct the hardware subsystem to gracefully process all inflight requests of a specific application-assigned host PASID. Each time a control command is issued by a software application or thread, the guest PASID in the command must be translated by the hypervisor into a corresponding host PASID, incurring high overhead in the process.
Embodiments of the present invention extends the enqueue command instruction to include the ability to submit control commands. This helps eliminate hypervisor context switch for common command submission in a VM, thereby reduces the burden on the hypervisor software and increases performance. In one embodiment, the format of the job descriptor includes a command type field to indicate whether the command in the job descriptor is a control command. When the job descriptor is enqueued into the job queue of the hardware subsystem, the command type field is copied over to the job queue. In another embodiment, the control type field is added to the entries in the job queue. When a special form of the enqueue command instruction is executed by the CPU, the job descriptor is stored or copied to the entry in the job queue, and the command type field of the entry is automatically updated to indicate that the job descriptor contains a control command. In some embodiments, the job queue is comprised of one or more registers.
The use of enqueue command instructions to submit common control commands increases CPU performance because the virtual machines do not need to trap into the hypervisor for translating guest PASID. This also provides a generic and simple way that all device manufacturers can adopt for common control commands so that they no longer need to define their own interface/format for common control commands. Moreover, since embodiments of the present invention is an extension to the enqueue command instruction, it is compatible with the existing instruction set architecture (ISA). The common format is also generic across various types of hardware or PCI devices, including network adapters, graphics accelerators, data accelerators, etc.
The following are example implementations of different embodiments of the invention.
Example 1. An apparatus that includes a processor to execute an enqueue instruction to submit, to a hardware subsystem, a job descriptor describing a job to be performed. The job descriptor references a memory location in which data required to perform the job is stored. The memory location is referenced by a first memory address in a first address space. The apparatus further includes an input-output memory management unit (IOMMU) to obtain an address translation for the memory location responsive to a pre-translation request from the processor. The address translation is obtained by the IOMMU prior to receiving a request for the data from the hardware subsystem to perform the job. The address translation includes a mapping of the first address in the first address space to a second address in a second address space. Responsive to the memory access request, the IOMMU is to retrieve the data from the memory location based on the address translation and to provide the data to the hardware subsystem to fulfill the request.
Example 2. The apparatus of Example 1, wherein the hardware subsystem is to perform the job using the data received from the IOMMU.
Example 3. The apparatus of Example 1, wherein the request is a direct memory access (DMA) request to access the memory.
Example 4. The apparatus of Example 1, further including a local cache of the IOMMU to store the address translation.
Example 5. The apparatus of Example 1, wherein enqueue instruction is to specify a memory address of the job descriptor and an identifier of the hardware subsystem.
Example 6. The apparatus of Example 1, wherein the job descriptor includes a pre-translation indicator to indicate whether the processor is to send the pre-translation request to the IOMMU.
Example 7. The apparatus of Example 6, wherein the processor is to determine the first address from the job descriptor and provide the first address to the IOMMU when the pre-translation indicator is set to a first value.
Example 8. The apparatus of Example 7, wherein the processor is further to provide information to the IOMMU to identify one or more page tables from which the IOMMU is to obtain the address translation.
Example 9. The apparatus of Example 7, wherein the processor is not to determine the first address from the job descriptor and/or not to provide the first address to the IOMMU when the pre-translation indicator is set to a second value.
Example 10. The apparatus of Example 6, wherein the first address space is a virtual address space and the second address space is a physical address space, and wherein the address translation comprises a virtual-to-physical address translation for the first address.
Example 11. The apparatus of Example 1, wherein the memory location is referenced directly by the job descriptor.
Example 12. The apparatus of Example 1, wherein the memory location is referenced indirectly by the job descriptor.
Example 13. The apparatus of Example 1, wherein the processor is to store the job descriptor into a job queue of the hardware subsystem responsive to an execution of the enqueue instruction.
Example 14. A method that includes: executing, by a processor, an enqueue instruction to submit, to a hardware subsystem, a job descriptor describing a job to be performed, the job descriptor referencing a memory location in which data required to perform the job is stored, the memory location referenced by a first memory address in a first address space; obtaining, by an input-output memory management unit (IOMMU), an address translation for the memory location responsive to a pre-translation request from the processor and prior to the IOMMU receiving a request for the data from the hardware subsystem to perform the job, the address translation comprising a mapping of the first address in the first address space to a second address in a second address space; and responsive to the memory access request, retrieving the data from the memory location based on the address translation and providing the data to the hardware subsystem to fulfill the request.
Example 15. The method of Example 14, further including performing the job using the data received from the IOMMU.
Example 16. The method of Example 14, wherein the request is a direct memory access (DMA) request to access the memory.
Example 17. The method of Example 14, further including storing the address translation in a local cache of the IOMMU.
Example 18. The method of Example 14, further including specifying a memory address of the job descriptor and an identifier of the hardware subsystem in the enqueue instruction.
Example 19. The method of Example 14, further including setting a pre-translation indicator of the job descriptor to indicate whether the processor is to send the pre-translation request to the IOMMU.
Example 20. The method of Example 19, further including determining the first address from the job descriptor and providing the first address to the IOMMU when the pre-translation indicator is set to a first value.
Example 21. The method of Example 20, further including providing information to the IOMMU for identifying one or more page tables from which to obtain the address translation.
Example 22. The method of Example 20, further including not determining the first address from the job descriptor and/or not providing the first address to the IOMMU when the pre-translation indicator is set to a second value.
Example 23. The method of Example 19, wherein the first address space is a virtual address space and the second address space is a physical address space, and wherein the address translation comprises a virtual-to-physical address translations of the first address.
Example 24. The method of Example 14, wherein the memory location is referenced directly by the job descriptor.
Example 25. The method of Example 14, wherein the memory location is referenced indirectly by the job descriptor.
Example 26. The method of Example 14, further comprises storing the job descriptor into a job queue of the hardware subsystem responsive to an execution of the enqueue instruction.
In
The front end hardware 1830 includes a branch prediction hardware 1832 coupled to an instruction cache hardware 1834, which is coupled to an instruction translation lookaside buffer (TLB) 1836, which is coupled to an instruction fetch hardware 1838, which is coupled to a decode hardware 1840. The decode hardware 1840 (or decoder) may decode instructions, and generate as an output one or more micro-operations, micro-code entry points, microinstructions, other instructions, or other control signals, which are decoded from, or which otherwise reflect, or are derived from, the original instructions. The decode hardware 1840 may be implemented using various different mechanisms. Examples of suitable mechanisms include, but are not limited to, look-up tables, hardware implementations, programmable logic arrays (PLAs), microcode read only memories (ROMs), etc. In one embodiment, the core 1890 includes a microcode ROM or other medium that stores microcode for certain macroinstructions (e.g., in decode hardware 1840 or otherwise within the front end hardware 1830). The decode hardware 1840 is coupled to a rename/allocator hardware 1852 in the execution engine hardware 1850.
The execution engine hardware 1850 includes the rename/allocator hardware 1852 coupled to a retirement hardware 1854 and a set of one or more scheduler hardware 1856. The scheduler hardware 1856 represents any number of different schedulers, including reservations stations, central instruction window, etc. The scheduler hardware 1856 is coupled to the physical register file(s) hardware 1858. Each of the physical register file(s) hardware 1858 represents one or more physical register files, different ones of which store one or more different data types, such as scalar integer, scalar floating point, packed integer, packed floating point, vector integer, vector floating point, status (e.g., an instruction pointer that is the address of the next instruction to be executed), etc. In one embodiment, the physical register file(s) hardware 1858 comprises a vector registers hardware, a write mask registers hardware, and a scalar registers hardware. This register hardware may provide architectural vector registers, vector mask registers, and general purpose registers. The physical register file(s) hardware 1858 is overlapped by the retirement hardware 1854 to illustrate various ways in which register renaming and out-of-order execution may be implemented (e.g., using a reorder buffer(s) and a retirement register file(s); using a future file(s), a history buffer(s), and a retirement register file(s); using a register maps and a pool of registers; etc.). The retirement hardware 1854 and the physical register file(s) hardware 1858 are coupled to the execution cluster(s) 1860. The execution cluster(s) 1860 includes a set of one or more execution hardware 1862 and a set of one or more memory access hardware 1864. The execution hardware 1862 may perform various operations (e.g., shifts, addition, subtraction, multiplication) and on various types of data (e.g., scalar floating point, packed integer, packed floating point, vector integer, vector floating point). While some embodiments may include a number of execution hardware dedicated to specific functions or sets of functions, other embodiments may include only one execution hardware or multiple execution hardware that all perform all functions. The scheduler hardware 1856, physical register file(s) hardware 1858, and execution cluster(s) 1860 are shown as being possibly plural because certain embodiments create separate pipelines for certain types of data/operations (e.g., a scalar integer pipeline, a scalar floating point/packed integer/packed floating point/vector integer/vector floating point pipeline, and/or a memory access pipeline that each have their own scheduler hardware, physical register file(s) hardware, and/or execution cluster—and in the case of a separate memory access pipeline, certain embodiments are implemented in which only the execution cluster of this pipeline has the memory access hardware 1864). It should also be understood that where separate pipelines are used, one or more of these pipelines may be out-of-order issue/execution and the rest in-order.
The set of memory access hardware 1864 is coupled to the memory hardware 1870, which includes a data TLB hardware 1872 coupled to a data cache hardware 1874 coupled to a level 2 (L2) cache hardware 1876. In one exemplary embodiment, the memory access hardware 1864 may include a load hardware, a store address hardware, and a store data hardware, each of which is coupled to the data TLB hardware 1872 in the memory hardware 1870. The instruction cache hardware 1834 is further coupled to a level 2 (L2) cache hardware 1876 in the memory hardware 1870. The L2 cache hardware 1876 is coupled to one or more other levels of cache and eventually to a main memory.
By way of example, the exemplary register renaming, out-of-order issue/execution core architecture may implement the pipeline 1800 as follows: 1) the instruction fetch 1838 performs the fetch and length decoding stages 1802 and 1804; 2) the decode hardware 1840 performs the decode stage 1806; 3) the rename/allocator hardware 1852 performs the allocation stage 1808 and renaming stage 1810; 4) the scheduler hardware 1856 performs the schedule stage 1812; 5) the physical register file(s) hardware 1858 and the memory hardware 1870 perform the register read/memory read stage 1814; the execution cluster 1860 perform the execute stage 1816; 6) the memory hardware 1870 and the physical register file(s) hardware 1858 perform the write back/memory write stage 1818; 7) various hardware may be involved in the exception handling stage 1822; and 8) the retirement hardware 1854 and the physical register file(s) hardware 1858 perform the commit stage 1824.
The core 1890 may support one or more instructions sets (e.g., the x86 instruction set (with some extensions that have been added with newer versions); the MIPS instruction set of MIPS Technologies of Sunnyvale, CA; the ARM instruction set (with optional additional extensions such as NEON) of ARM Holdings of Sunnyvale, CA), including the instruction(s) described herein. In one embodiment, the core 1890 includes logic to support a packed data instruction set extension (e.g., AVX1, AVX2, and/or some form of the generic vector friendly instruction format (U=0 and/or U=1), described below), thereby allowing the operations used by many multimedia applications to be performed using packed data.
It should be understood that the core may support multithreading (executing two or more parallel sets of operations or threads), and may do so in a variety of ways including time sliced multithreading, simultaneous multithreading (where a single physical core provides a logical core for each of the threads that physical core is simultaneously multithreading), or a combination thereof (e.g., time sliced fetching and decoding and simultaneous multithreading thereafter such as in the Intel® Hyperthreading technology).
While register renaming is described in the context of out-of-order execution, it should be understood that register renaming may be used in an in-order architecture. While the illustrated embodiment of the processor also includes separate instruction and data cache hardware 1834/1874 and a shared L2 cache hardware 1876, alternative embodiments may have a single internal cache for both instructions and data, such as, for example, a Level 1 (L1) internal cache, or multiple levels of internal cache. In some embodiments, the system may include a combination of an internal cache and an external cache that is external to the core and/or the processor. Alternatively, all of the cache may be external to the core and/or the processor.
Thus, different implementations of the processor 1900 may include: 1) a CPU with the special purpose logic 1908 being integrated graphics and/or scientific (throughput) logic (which may include one or more cores), and the cores 1902A-N being one or more general purpose cores (e.g., general purpose in-order cores, general purpose out-of-order cores, a combination of the two); 2) a coprocessor with the cores 1902A-N being a large number of special purpose cores intended primarily for graphics and/or scientific (throughput); and 3) a coprocessor with the cores 1902A-N being a large number of general purpose in-order cores. Thus, the processor 1900 may be a general-purpose processor, coprocessor or special-purpose processor, such as, for example, a network or communication processor, compression engine, graphics processor, GPGPU (general purpose graphics processing unit), a high-throughput many integrated core (MIC) coprocessor (including 30 or more cores), embedded processor, or the like. The processor may be implemented on one or more chips. The processor 1900 may be a part of and/or may be implemented on one or more substrates using any of a number of process technologies, such as, for example, BiCMOS, CMOS, or NMOS.
The memory hierarchy includes one or more levels of cache within the cores, a set or one or more shared cache hardware 1906, and external memory (not shown) coupled to the set of integrated memory controller hardware 1914. The set of shared cache hardware 1906 may include one or more mid-level caches, such as level 2 (L2), level 3 (L3), level 4 (L4), or other levels of cache, a last level cache (LLC), and/or combinations thereof. While in one embodiment a ring based interconnect hardware 1912 interconnects the integrated graphics logic 1908, the set of shared cache hardware 1906, and the system agent hardware 1910/integrated memory controller hardware 1914, alternative embodiments may use any number of well-known techniques for interconnecting such hardware. In one embodiment, coherency is maintained between one or more cache hardware 1906 and cores 1902-A-N.
In some embodiments, one or more of the cores 1902A-N are capable of multi-threading. The system agent 1910 includes those components coordinating and operating cores 1902A-N. The system agent hardware 1910 may include for example a power control unit (PCU) and a display hardware. The PCU may be or include logic and components needed for regulating the power state of the cores 1902A-N and the integrated graphics logic 1908. The display hardware is for driving one or more externally connected displays.
The cores 1902A-N may be homogenous or heterogeneous in terms of architecture instruction set; that is, two or more of the cores 1902A-N may be capable of execution the same instruction set, while others may be capable of executing only a subset of that instruction set or a different instruction set. In one embodiment, the cores 1902A-N are heterogeneous and include both the “small” cores and “big” cores described below.
Referring now to
The optional nature of additional processors 2015 is denoted in
The memory 2040 may be, for example, dynamic random access memory (DRAM), phase change memory (PCM), or a combination of the two. For at least one embodiment, the controller hub 2020 communicates with the processor(s) 2010, 2015 via a multi-drop bus, such as a frontside bus (FSB), point-to-point interface, or similar connection 2095.
In one embodiment, the coprocessor 2045 is a special-purpose processor, such as, for example, a high-throughput MIC processor, a network or communication processor, compression engine, graphics processor, GPGPU, embedded processor, or the like. In one embodiment, controller hub 2020 may include an integrated graphics accelerator.
There can be a variety of differences between the physical resources 2010, 2015 in terms of a spectrum of metrics of merit including architectural, microarchitectural, thermal, power consumption characteristics, and the like.
In one embodiment, the processor 2010 executes instructions that control data processing operations of a general type. Embedded within the instructions may be coprocessor instructions. The processor 2010 recognizes these coprocessor instructions as being of a type that should be executed by the attached coprocessor 2045. Accordingly, the processor 2010 issues these coprocessor instructions (or control signals representing coprocessor instructions) on a coprocessor bus or other interconnect, to coprocessor 2045. Coprocessor(s) 2045 accept and execute the received coprocessor instructions.
Referring now to
Processors 2170 and 2180 are shown including integrated memory controller (IMC) hardware 2172 and 2182, respectively. Processor 2170 also includes as part of its bus controller hardware point-to-point (P-P) interfaces 2176 and 2178; similarly, second processor 2180 includes P-P interfaces 2186 and 2188. Processors 2170, 2180 may exchange information via a point-to-point (P-P) interface 2150 using P-P interface circuits 2178, 2188. As shown in
Processors 2170, 2180 may each exchange information with a chipset 2190 via individual P-P interfaces 2152, 2154 using point to point interface circuits 2176, 2194, 2186, 2198. Chipset 2190 may optionally exchange information with the coprocessor 2138 via a high-performance interface 2139. In one embodiment, the coprocessor 2138 is a special-purpose processor, such as, for example, a high-throughput MIC processor, a network or communication processor, compression engine, graphics processor, GPGPU, embedded processor, or the like.
A shared cache (not shown) may be included in either processor or outside of both processors, yet connected with the processors via P-P interconnect, such that either or both processors' local cache information may be stored in the shared cache if a processor is placed into a low power mode.
Chipset 2190 may be coupled to a first bus 2116 via an interface 2196. In one embodiment, first bus 2116 may be a Peripheral Component Interconnect (PCI) bus, or a bus such as a PCI Express bus or another third generation I/O interconnect bus, although the scope of the present invention is not so limited.
As shown in
Referring now to
Referring now to
Embodiments of the mechanisms disclosed herein may be implemented in hardware, software, firmware, or a combination of such implementation approaches. Embodiments of the invention may be implemented as computer programs or program code executing on programmable systems comprising at least one processor, a storage system (including volatile and non-volatile memory and/or storage elements), at least one input device, and at least one output device.
Program code, such as code 2130 illustrated in
The program code may be implemented in a high level procedural or object oriented programming language to communicate with a processing system. The program code may also be implemented in assembly or machine language, if desired. In fact, the mechanisms described herein are not limited in scope to any particular programming language. In any case, the language may be a compiled or interpreted language.
One or more aspects of at least one embodiment may be implemented by representative instructions stored on a machine-readable medium which represents various logic within the processor, which when read by a machine causes the machine to fabricate logic to perform the techniques described herein. Such representations, known as “IP cores” may be stored on a tangible, machine readable medium and supplied to various customers or manufacturing facilities to load into the fabrication machines that actually make the logic or processor.
Such machine-readable storage media may include, without limitation, non-transitory, tangible arrangements of articles manufactured or formed by a machine or device, including storage media such as hard disks, any other type of disk including floppy disks, optical disks, compact disk read-only memories (CD-ROMs), compact disk rewritable's (CD-RWs), and magneto-optical disks, semiconductor devices such as read-only memories (ROMs), random access memories (RAMs) such as dynamic random access memories (DRAMs), static random access memories (SRAMs), erasable programmable read-only memories (EPROMs), flash memories, electrically erasable programmable read-only memories (EEPROMs), phase change memory (PCM), magnetic or optical cards, or any other type of media suitable for storing electronic instructions.
Accordingly, embodiments of the invention also include non-transitory, tangible machine-readable media containing instructions or containing design data, such as Hardware Description Language (HDL), which defines structures, circuits, apparatuses, processors and/or system features described herein. Such embodiments may also be referred to as program products.
In some cases, an instruction converter may be used to convert an instruction from a source instruction set to a target instruction set. For example, the instruction converter may translate (e.g., using static binary translation, dynamic binary translation including dynamic compilation), morph, emulate, or otherwise convert an instruction to one or more other instructions to be processed by the core. The instruction converter may be implemented in software, hardware, firmware, or a combination thereof. The instruction converter may be on processor, off processor, or part on and part off processor.
Although some embodiments have been described in reference to particular implementations, other implementations are possible according to some embodiments. Additionally, the arrangement and/or order of elements or other features illustrated in the drawings and/or described herein need not be arranged in the particular way illustrated and described. Many other arrangements are possible according to some embodiments.
In each system shown in a figure, the elements in some cases may each have a same reference number or a different reference number to suggest that the elements represented could be different and/or similar. However, an element may be flexible enough to have different implementations and work with some or all of the systems shown or described herein. The various elements shown in the figures may be the same or different. Which one is referred to as a first element and which is called a second element is arbitrary.
In the description and claims, the terms “coupled” and “connected,” along with their derivatives, may be used. It should be understood that these terms are not intended as synonyms for each other. Rather, in particular embodiments, “connected” may be used to indicate that two or more elements are in direct physical or electrical contact with each other. “Coupled” may mean that two or more elements are in direct physical or electrical contact. However, “coupled” may also mean that two or more elements are not in direct contact with each other, but yet still co-operate or interact with each other.
An embodiment is an implementation or example of the inventions. Reference in the specification to “an embodiment,” “one embodiment,” “some embodiments,” or “other embodiments” means that a particular feature, structure, or characteristic described in connection with the embodiments is included in at least some embodiments, but not necessarily all embodiments, of the inventions. The various appearances “an embodiment,” “one embodiment,” or “some embodiments” are not necessarily all referring to the same embodiments.
Not all components, features, structures, characteristics, etc. described and illustrated herein need be included in a particular embodiment or embodiments. If the specification states a component, feature, structure, or characteristic “may”, “might”, “can” or “could” be included, for example, that particular component, feature, structure, or characteristic is not required to be included. If the specification or claim refers to “a” or “an” element, that does not mean there is only one of the element. If the specification or claims refer to “an additional” element, that does not preclude there being more than one of the additional element.
The above description of illustrated embodiments of the invention, including what is described in the Abstract, is not intended to be exhaustive or to limit the invention to the precise forms disclosed. While specific embodiments of, and examples for, the invention are described herein for illustrative purposes, various equivalent modifications are possible within the scope of the invention, as those skilled in the relevant art will recognize.
These modifications can be made to the invention in light of the above detailed description. The terms used in the following claims should not be construed to limit the invention to the specific embodiments disclosed in the specification and the drawings. Rather, the scope of the invention is to be determined entirely by the following claims, which are to be construed in accordance with established doctrines of claim interpretation.
Filing Document | Filing Date | Country | Kind |
---|---|---|---|
PCT/CN2020/138775 | 12/24/2020 | WO |