Embodiments of the invention relate to heterogeneous computing.
According to Dennard scaling, voltage and current should be proportional to the linear dimensions of a transistor, and power consumption (the product of voltage and current) should be proportional to the area of a transistor. As the sizes of transistors continue to shrink, the number of transistors that can fit into the same area of a chip has grown exponentially. Thus, it has been predicted that the computing performance per watt can also grow exponentially. However, Dennard scaling appears to have broken down in the last decade. Even though the size of transistors continues to shrink, the per watt computing performance has not improved at the same rate. There are various reasons for the breakdown of Dennard scaling. One of the reasons is that at small sizes current leakage can cause a chip to heat up which increases energy costs and the risk of thermal runaway. To prevent thermal runaway, a portion of the silicon on the chip cannot be powered-on at the nominal operating voltage for a given thermal design power (TDP) constraint. This phenomenon, referred to as “dark silicon,” significantly constraints the per watt computing performance in modern processors.
The breakdown of Dennard scaling has prompted chip manufacturers to resort to multicore processor designs. However, even multicore processors have encountered the same “dark silicon” problem. Depending on the processor architecture, cooling technology, and application workloads, the amount of dark silicon may exceed 50%. Thus, there is a need to improve energy and computing efficiency in modern computer systems.
In one embodiment, a heterogeneous computing system is provided. The heterogeneous computing system includes a plurality of processors of different processor types, wherein each processor includes an internal memory unit to store its current context. The heterogeneous computing system also includes a parallel processing module which further includes a plurality of execution units. The heterogeneous computing system also includes a switch module coupled to the processors and the parallel processing module. The switch module is operative to select, according to a control signal, one of the processors to use the parallel processing module for executing an instruction with multiple data entries in parallel.
In another embodiment, a method is provided to be performed by a heterogeneous computing system. The method comprises selecting, according to a control signal, one of a plurality of processors to connect to a parallel processing module in the heterogeneous computing system. The processors have different processor types and each processor includes an internal memory unit to store its context. The parallel processing module includes a plurality of execution units. The method further comprises receiving, by the parallel processing module, an instruction with multiple data entries from the one of the processors; and executing, by the execution units, the instruction on the multiple data entries in parallel.
The present invention is illustrated by way of example, and not by way of limitation, in the figures of the accompanying drawings in which like references indicate similar elements. It should be noted that different references to “an” or “one” embodiment in this disclosure are not necessarily to the same embodiment, and such references mean at least one. Further, when a particular feature, structure, or characteristic is described in connection with an embodiment, it is submitted that it is within the knowledge of one skilled in the art to effect such feature, structure, or characteristic in connection with other embodiments whether or not explicitly described.
In the following description, numerous specific details are set forth. However, it is understood that embodiments of the invention may be practiced without these specific details. In other instances, well-known circuits, structures and techniques have not been shown in detail in order not to obscure the understanding of this description. It will be appreciated, however, by one skilled in the art, that the invention may be practiced without such specific details. Those of ordinary skill in the art, with the included descriptions, will be able to implement appropriate functionality without undue experimentation.
A heterogeneous computing system includes more than one type of processors working in tandem to perform computing tasks. For example, a heterogeneous computing system may include processors such as one or more central processing units (CPUs), one or more graphics processing units (GPUs), one or more digital signal processors (DSPs), one or more application-specific instruction set processors (ASIPs), one or more application-specific integrated circuits (ASICs) etc. In some embodiments, the processors may all be integrated into a system-on-a-chip (SoC) platform.
As an example, a heterogeneous computing system may include a combination of CPUs, GPUs, DSPs, ASIPs and ASICs. The CPU performs general-purpose computing tasks. The DSP and ASIP perform signal, image and/or multimedia processing operations. Both DSP and ASIP may be programmable. An example of an ASIP is a specialized hardware accelerator that performs specialized functions supported by the system. An ASIC is a fixed-function processor that performs a pre-determined sequence of specialized operations; e.g., encoding and decoding. The GPU performs graphics processing tasks; e.g., creating 2D raster representations of 3D scenes. These graphics processing tasks are referred to as 3D graphics pipelining or rendering pipelining. The 3D graphics pipelining may be implemented by a combination of fixed-function hardware tailored for speeding up the computation, and general-purpose programmable hardware to allow flexibility in graphics rendering. The general-purpose programmable hardware is also referred to as shader hardware. In addition to rendering graphics, the shader hardware can also perform general computing tasks.
The processors in a heterogeneous computing system typically include parallel execution hardware for performing single-instruction-multiple-data (SIMD) operations. In prior art systems such SIMD architecture is implemented separately in each processor. Therefore, in these systems the SIMD architecture is duplicated. The areas occupied by the duplicated SIMD architecture are not fully utilized because not all processors are performing SIMD execution at the same time.
According to embodiments of the invention, processors of a heterogeneous computing system perform SIMD operations using a shared parallel processing module that includes multiple execution units, such as Arithmetic Logic Units (ALUs). The sharing of the execution units reduces hardware costs and increases hardware utilization. To reduce the context switch overhead when SIMD execution switches from one processor to another, each processor maintains separate memory control. More specifically, each processor maintains its own context in its internal memory unit such as registers and/or buffers. Each processor also has its own memory interface for accessing instructions and data from a system memory such as dynamic random access memory (DRAM) devices. The separate memory control reduces the amount of context switch and therefore increases energy and computing efficiency.
The term “context switch” in computing generally refers to the mechanism of storing and restoring the state (also referred to as “context”) of a process or thread so that execution can be resumed from the same point at a later time. Examples of the context include, but are not limited to program counter, stack pointer, register contents, etc. According to embodiments of the invention, the processors that share the execution units store their respective context (e.g., execution state) locally and separately, such that when SIMD execution switches from a first processor to a second processor, there is no or negligible context switch overhead for storing the context of the first processor and restoring the context of the second processor. That is, instead of using a common process and shared buffers for context switching among processors, each processor stores its own context in its internal memory unit such as local buffers. When the SIMD execution switches from the first processor to the second processor, the context of the first processor remains in the first processor and is ready for use when needed later. The context of the second processor is in the second processor and can be used right away by the second processor. The separate context management avoids the time and energy consuming context storage and restoration when SIMD execution switches among the processors.
Additionally, each processor has its own memory interface for accessing the system memory for instructions, data and other information. The term “memory interface” refers to a hardware unit in the processor that has access to the system memory. Examples of the memory interfaces include, but are not limited to direct memory access (DMA) unit, load and store unit, etc. Having separate memory interfaces enables the processors to keep their specific data flow control.
The processors 112 are connected to the system memory 160 via an interconnect 150. The processors 112 are also connected to a switch module 120, which is further connected to a unified decoder 130 and a parallel processing module 140. The switch module 120 can be controlled to connect any one of the processors 112 to the unified decoder 130 and the parallel processing module 140. The parallel processing module 140 includes a plurality of execution units (EUs) 142; e.g., ALUs. Each of the execution units 142 executes arithmetic or logic operations, and the parallel processing module 140, as a whole, executes SIMD operations. That is, the parallel processing module 140 can execute a single instruction on multiple data entries in parallel. The instructions executed by the execution units 142 have a unified instruction format according to an instruction set architecture (ISA) defined for the parallel processing module 140. The data executed by the execution units 142 has a unified data format defined in a set of unified data formats. For example, the unified data formats may include full-precision, short integer, floating point, long integer, etc. In one embodiment, the parallel processing module 140 may include a vector execution unit that perform a vector operation on an array of data.
In one embodiment, the switch module 120 is controlled by a context switch controller 170, which may be a hardware unit or a software process located in or executed by one or more CPUs or other control hardware. The context switch controller 170 determines which processor 112 the SIMD execution should be switch to, and generates a control signal that selects a processor 112 to connect to the parallel processing module 140. An example of the context switch controller 170 is provided in
In one embodiment, the heterogeneous computing system 100 may be part of a mobile computing and/or communication device (e.g., a smartphone, a tablet, a laptop, a gaming device, etc.). In one embodiment, the heterogeneous computing system 100 may be part of a desktop computing system, a server computing system, or a cloud computing system.
The GPU shader 210 is a programmable processor specialized for graphics operation. In one embodiment, the GPU shader 210 includes a command queue 211, a control unit 212, program register files 214, shared buffers 215, special functions 216, the memory interface 118 and other units. Examples of the control unit 212 include, but are not limited to, branch predictors, command fetch units, etc. The DSP 220 is a programmable processor, which includes a sequencer 221, a direct-memory-access (DMA) 222, local buffers 223, the memory interface 118 and other units. The ASIP 230 is also a programmable processor, which includes a specialized memory interface 231, specialized buffers 232, special functions 233, a sequencer 234, the memory interface 118 and other units. Additionally, one or more of the GPU shader 210, DSP 220 and ASIP 230 may include a cache for storing recently accessed and/or pre-fetched data retrieved from the system memory 160, and a buffer or other types of temporary memory for storing the intermediate data generated by the parallel processing module 140, among other information. The DSP 220 and the ASIP 230 are programmable processors for performing specialized functions. Examples of their special functions 216 and 233 include, but are not limited to: special mathematical functional units such as sine, cosine and log functions, graphics processing, voice data processing, video processing, and image processing.
In one embodiment, each processor has a built-in mechanism (e.g., the command queue 211, the sequencer 221 and the sequencer 234) for determining which instruction to execute next, as well as internal registers or buffers (i.e., on-processor registers or on-processor buffers) for storing the current context such as program counter, stack pointer, register contents, etc. When SIMD execution switches from a first processor to a second processor, the stored context of the second processor can be quickly (e.g., in one cycle) retrieved from its internal registers or buffers to start the execution process. The context of the first processor is stored in its internal registers or buffers for fast retrieval when the SIMD execution switches back to the first processor.
Although each processor has internal registers or buffers to store its context, in some scenarios the amount of contexts may exceed the capacity of these internal registers or buffers. For example, when a single processor executes multiple tasks and one or more of the tasks have real-time constraints, the processor may switch the contexts among the multiple tasks. To store the contexts of these multiple tasks, processor may use an external buffer (i.e., off-processor buffer or off-chip buffer) to store the contexts if the amount of contexts exceeds its internal context storage capacity.
In some embodiments, the frontend 331 may be part of one or more of the processors 112; that is, part of the processors' native decode-and-fetch circuitry. For example, processor P1 may include the instruction decode 320a and the data fetch 310a, as shown in the dashed lines, as part of its native decode-and-fetch circuitry. An instruction is executed by P1 if it is decoded to be a non-SIMD instruction; the instruction is sent to the parallel processing module 140 for execution if it is decoded to be a SIMD instruction. In some embodiments, one or more processors 112 such as fixed-function processors execute a pre-determined sequence of operation and therefore may not need to decode instructions. These fixed-function processors do not have native decode circuitry for decoding instructions. In this case (e.g., P4), the unified decoder 130 provides the instruction decode 320d that generates an indicator when a SIMD operation is to be performed. The indicator may specify the SIMD operation to be performed and the data format of the SIMD operation. The indicator and the source operands fetched by the data fetch 310d are then sent to the backend 332 via the switch module 120 when P4 is selected for connection to the parallel processing module 140.
In the embodiment of
The process 400 repeats from step 410 each time when a processor is selected for SIMD execution. For example, when the control signal selects another processor (“next processor”) for SIMD execution, the next processor can use its locally stored context to retrieve an instruction for execution without reloading and restoring that context into its local memory. In addition, the context of the previous processor (i.e., the target processor) can stay locally within the target processor. The target processor may continue to perform non-SIMD operations using its locally stored context, or may wait for its turn to use the parallel processing module 140 again for SIMD execution.
The context switch controller 170 may use different hardware modules to implement different scheduling policies for requests that have different priorities. For example, in the embodiment of
The method 600 may repeat the steps 610-630 whenever the control signal select a different processor for SIMD execution. The context switch among the processors incurs little or no overhead. In one embodiment, the parallel processing module is operative to complete execution for a first processor in a first clock cycle and to receive data from a second processor in a second clock cycle immediate after the first clock cycle.
A heterogeneous computing system with a shared computing unit and separate memory controls has been described. The sharing of the computing unit (e.g., the parallel processing module 140) reduces hardware cost and increases hardware utilization. The separate memory control for each processor enables the processors to maintain their own contexts and data flow controls, and therefore reduces the context switch overhead. Thereby the overall energy and computing efficiency of the system can be improved.
The operations of the flow diagrams of
While the invention has been described in terms of several embodiments, those skilled in the art will recognize that the invention is not limited to the embodiments described, and can be practiced with modification and alteration within the spirit and scope of the appended claims. The description is thus to be regarded as illustrative instead of limiting.