Computer systems may often include accelerators built for computationally intensive workloads, e.g. media encoding/decoding, signal processing, sorting, pattern matching, compression or cryptography. These accelerators often include a large number of processing elements arranged as a grid, with each element of the grid being a small processor that executes a standard, sequential program stream. The processing of the sequential program may be viewed as requiring operations separated into two distinct classes: control processing operations and data processing operations. In a standard processor, both the control and data processing streams are handled as instructions dispatched to and executed in the execution logic of the processor.
However, this can lead to several inefficiencies. For example, in a conventional processor a large number of instructions are devoted solely to computing what the next set of instructions should be (i.e. which instructions are “ready”), from where data should be retrieved and to where data may be stored. If instead a programmer describes a pool of operations that execute based on the arrival of certain patterns of inputs then it is possible to separate out the computation of which instructions are “ready” into a parallel circuit that may improve performance dramatically by avoiding instruction-level polling of data sources.
Embodiments of the present invention avoid the standard sequential programming model for a processor by providing separate hardware components for control processing and data processing. The instruction execution order in a processing engine according to the present invention can be efficiently determined by receiving input in a control processing engine and, for each instruction of a data processing engine, setting a status of the instruction to “ready” based on a trigger for the instruction and the input received in the control processing engine. Execution of the instruction in the data processing engine may be enabled if the status of the instruction is set to “ready” and at least one processing element of the data processing engine is available to execute the instruction. In one example embodiment, the instructions may then be decoded into micro instructions or nano instructions before they are executed in the data processing engine. The trigger for each instruction may be implemented by a programmer as a function of at least one predicate register of the control processing engine, FIFO status signals from one or more FIFOs (e.g. FIFO[0], FIFO[1] etc. used for inbound/outbound data) and tags (metadata) that either arrive over FIFOs, or are already present in registers inside the processing engine.
This may provide several advantages for a processor, especially in the context of an accelerator. For example: control decisions that may have taken multiple instruction cycles on a standard PC-based architecture may now be computed in a single cycle, control processing for multiple instructions may be computed in parallel if multiple instructions are ready to be executed and processing elements are available, and multiple algorithms may be mapped to a single processing element and executed by the processing element in an interleaved manner.
A trigger 130 is a function that may be implemented by a programmer, e.g. a Boolean function. The function specification for each trigger 130 is stored alongside each instruction 120 in the CPE's instruction storage. The function may be a Boolean expression of predicate registers 110, FIFO status signals 180, and/or comparisons of tags 190 against target values or other tags. Predicate registers 110 and FIFO status signals 180 may themselves be Boolean (true/false) values and can therefore be fed directly into a Boolean function. Tags, however, may be multi-bit values. Therefore a comparison of a tag against an equal bit-width target value or other tag may be used for a true/false signal that can be fed into the Boolean expression in the trigger function. Alternatively comparison of a single bit or a bit mask in a tag against a target value or a true/false test for a single bit or a bit mask in a tag being less than/greater than some value could be used. For example, trigger[3]=pred[0] && !pred[1] && fifo[0].notEmpty && (fifo[0].tag==1010) describes the conditions under which Instruction[3] in storage 120 is allowed to execute. In the situation where a trigger 130 is a function of FIFO status signals 180 or comparisons of tags 190, the trigger resolution module 150 may compute the output of each trigger 130 based on the input from predicate registers 110 of CPE 101 and the FIFO status signals 180 or comparisons of tags 190 in order to determine if a status of each instruction 120 should be set to “ready”.
FIFOs are used commonly in electronic circuits for buffering and flow control. In hardware form a FIFO primarily consists of a set of read and write pointers, storage and control logic. Storage may be SRAM, flip-flops, latches or any other suitable form of storage. Examples of FIFO status flags include: full, empty, almost full, almost empty, etc. Tags are used commonly for adding metadata to data, for example metadata associated with an algorithm indicating a source of the data. If two sources write to the same FIFO, a tag could be used to determine which source wrote a particular value. As mentioned above, tags may be multi-bit values: e.g. 1010.
An additional embodiment may provide architectural (hardware) support to guarantee that empty FIFOs are never read and full FIFOs are never written. In this case, the FIFO status signals 180 may not be made visible to the programmer. Instead, the hardware may infer these conditions by looking at the input and output FIFOs an instruction may attempt to read or write to when it is executed. In this case, the hardware may automatically add the appropriate not full or not empty trigger inputs to the trigger function specified by the programmer. Thus, an instruction that may attempt to read an empty FIFO or write a full FIFO will never be selected for execution because its trigger will evaluate to false, i.e. not “ready”.
A priority encoder 160 may enable instructions 120 with a “ready” status to be executed by processing elements 140 of DPE 102 if at least one processing element 140 of DPE 102 is available to execute the instruction. In one example embodiment, the enabled instruction (triggered instruction 170) may be selected for execution by a multiplexer M and then it may be decoded into micro instructions or nano instructions D1-D4 before being executed by processing elements 140 of DPE 102.
Parallel processing in trigger resolution module 150 of all the functions of triggers 130 that may trigger instructions 120 may reduce the time required to choose instructions that are ready to be executed to a single cycle of the processing engine 100 and the ordering execution of the triggered instructions 120 may automatically correspond to the arrival of inbound data needed for further execution.
Embodiments are not limited to computer systems. Alternative embodiments of the present invention can be used in other devices such as handheld devices and embedded applications. Some examples of handheld devices include cellular phones, Internet Protocol devices, digital cameras, personal digital assistants (PDAs), and handheld PCs. Embedded applications can include a micro controller, a digital signal processor (DSP), system on a chip, network computers (NetPC), set-top boxes, network hubs, wide area network (WAN) switches, or any other system that can perform one or more instructions in accordance with at least one embodiment.
In one embodiment, the processor 402 includes a Level 1 (L1) internal cache memory 404. Depending on the architecture, the processor 402 can have a single internal cache or multiple levels of internal cache. Alternatively, in another embodiment, the cache memory can reside external to the processor 402. Other embodiments can also include a combination of both internal and external caches depending on the particular implementation and needs. Register file 406 can store different types of data in various registers including integer registers, floating point registers, status registers, and instruction pointer register.
Alternate embodiments of a processing engine 408 can also be used in micro controllers, embedded processors, graphics devices, DSPs, and other types of logic circuits. System 400 includes a memory 420. Memory 420 can be a dynamic random access memory (DRAM) device, a static random access memory (SRAM) device, flash memory device, or other memory device. Memory 420 can store instructions and/or data represented by data signals that can be executed by the processor 402.
A system logic chip 416 is coupled to the processor bus 410 and memory 420. The system logic chip 416 in the illustrated embodiment is a memory controller hub (MCH). The processor 402 can communicate to the MCH 416 via a processor bus 410. The MCH 416 provides a high bandwidth memory path 418 to memory 420 for instruction and data storage and for storage of graphics commands, data and textures. The MCH 416 is to direct data signals between the processor 402, memory 420, and other components in the system 400 and to bridge the data signals between processor bus 410, memory 420, and system I/O 422. In some embodiments, the system logic chip 416 can provide a graphics port for coupling to a graphics controller 412. The MCH 416 is coupled to memory 420 through a memory interface 418. The graphics card 412 is coupled to the MCH 416 through an Accelerated Graphics Port (AGP) interconnect 414.
System 400 uses a proprietary hub interface bus 422 to couple the MCH 416 to the I/O controller hub (ICH) 430. The ICH 430 provides direct connections to some I/O devices via a local I/O bus. The local I/O bus is a high-speed I/O bus for connecting peripherals to the memory 420, chipset, and processor 402. Some examples are the audio controller, firmware hub (flash BIOS) 428, wireless transceiver 426, data storage 424, legacy I/O controller containing user input and keyboard interfaces, a serial expansion port such as Universal Serial Bus (USB), and a network controller 434. The data storage device 424 can comprise a hard disk drive, a floppy disk drive, a CD-ROM device, a flash memory device, or other mass storage device.
For another embodiment of a system, an instruction in accordance with one embodiment can be used with a system on a chip. One embodiment of a system on a chip comprises of a processor and a memory. The memory for one such system is a flash memory. The flash memory can be located on the same die as the processor and other system components. Additionally, other logic blocks such as a memory controller or graphics controller can also be located on a system on a chip.
While certain exemplary embodiments have been described and shown in the accompanying drawings, it is to be understood that such embodiments are merely illustrative of and not restrictive on the broad invention, and that this invention not be limited to the specific constructions and arrangements shown and described, since various other modifications may occur to those ordinarily skilled in the art upon studying this disclosure. In an area of technology such as this, where growth is fast and further advancements are not easily foreseen, the disclosed embodiments may be readily modifiable in arrangement and detail as facilitated by enabling technological advancements without departing from the principles of the present disclosure or the scope of the accompanying claims.
Filing Document | Filing Date | Country | Kind | 371c Date |
---|---|---|---|---|
PCT/US11/68117 | 12/30/2011 | WO | 00 | 6/21/2013 |