This application claims benefit of priority to Chinese Application No. 201710527119.8, entitled “STREAM PROCESSOR WITH OVERLAPPING EXECUTION”, filed Jun. 30, 2017, the entirety of which is incorporated herein by reference in its entirety.
Many different types of computing systems include vector processors or single-instruction, multiple-data (SIMD) processors. Tasks can execute in parallel on these types of parallel processors to increase the throughput of the computing system. It is noted that parallel processors can also be referred to herein as “stream processors”. Attempts to improve the throughput of stream processors are continually being undertaken. The term “throughput” can be defined as the amount of work (e.g., number of tasks) that a processor can perform in a given period of time. One technique for improving the throughput of stream processors is by increasing the instruction issue rate. However, increasing the instruction issue rate of a stream processor typically results in increased cost and power consumption. It can be challenging to increase the throughput of a stream processor without increasing the instruction issue rate.
The advantages of the methods and mechanisms described herein may be better understood by referring to the following description in conjunction with the accompanying drawings, in which:
In the following description, numerous specific details are set forth to provide a thorough understanding of the methods and mechanisms presented herein. However, one having ordinary skill in the art should recognize that the various embodiments may be practiced without these specific details. In some instances, well-known structures, components, signals, computer program instructions, and techniques have not been shown in detail to avoid obscuring the approaches described herein. It will be appreciated that for simplicity and clarity of illustration, elements shown in the figures have not necessarily been drawn to scale. For example, the dimensions of some of the elements may be exaggerated relative to other elements.
Systems, apparatuses, and methods for increasing processor throughput are disclosed herein. In one embodiment, processor throughput is increased by overlapping execution of multi-pass instructions with single pass instructions on separate execution pipelines without increasing the instruction issue rate. In one embodiment, a system includes at least a parallel processing unit with a plurality of execution pipelines. The parallel processing unit includes at least two different types of execution pipelines. These different types of execution pipelines can be referred to generally as first and second types of execution pipelines. In one embodiment, the first type of execution pipeline is a transcendental pipeline for performing transcendental operations (e.g., exponentiation, logarithm, trigonometric) and the second type of execution pipeline is a vector arithmetic logic unit (ALU) pipeline for performing fused multiply-add (FMA) operations. In other embodiments, the first and/or second types of processing pipelines can be other types of execution pipelines which process other types of operations.
In one embodiment, when the first type of execution pipeline is a transcendental pipeline, an application executing on the system can improve the shader performance for 3D graphics which have a high number of transcendental operations. The traditional way of fully utilizing the compute throughput of multiple execution pipelines is by implementing a multi-issue architecture with a complex instruction scheduler and a high bandwidth vector register file. However, the systems and apparatuses described herein include an instruction scheduler and a vector register file which are compatible with a single issue architecture.
In one embodiment, a multi-pass instruction (e.g., transcendental instruction) would take one cycle for the operands to be read into the first execution pipeline and to initiate execution of a first vector element, but starting from the next cycle, the execution of the second vector element could be overlapped with instructions on the second execution pipeline if there are no dependencies between the instructions. In other embodiments, the processor architecture can be implemented and applied to other multi-pass instructions (e.g., double precision floating point instructions). Utilizing the techniques described herein, the throughput of a processor with multiple execution units is increased without increasing the instruction issue rate.
In one embodiment, a first plurality of operands for multiple vector elements of a vector instruction, to be executed by the first execution pipeline, are read from the vector register file in a single clock cycle and stored in temporary storage. In one embodiment, the temporary storage is implemented by using flip-flops coupled to the outputs of the vector register file. The operands are accessed from the temporary storage and utilized to initiate execution of multiple operations on the first execution pipeline in subsequent clock cycle. Simultaneously, the second execution pipeline accesses a second plurality of operands from the vector register file to initiate execution of one or more vector operations on the second execution pipeline during the subsequent clock cycles. In one embodiment, the first execution pipeline has a separate write port to the vector destination cache to allow for co-execution with the second execution pipeline.
Referring now to
Processors(s) 110 are representative of any number and type of processing units (e.g., central processing unit (CPU), graphics processing unit (GPU), digital signal processor (DSP), field programmable gate array (FPGA), application specific integrated circuit (ASIC)). In one embodiment, processor(s) 110 includes a vector processor with a plurality of stream processors. Each stream processor can also be referred to as a processor or a processing lane. In one embodiment, each stream processor includes at least two types of execution pipelines that share a common vector register file. In one embodiment, the vector register file includes multi-bank high density random-access memories (RAMs). In various embodiments, execution of instructions can be overlapped on the multiple execution pipelines to increase throughput of the stream processors. In one embodiment, the first execution pipeline has a first write port to a vector destination cache and the second execution pipeline has a second write port to the vector destination cache to allow both execution pipelines to write to the vector destination cache in the same clock cycle.
Memory device(s) 130 are representative of any number and type of memory devices. For example, the type of memory in memory device(s) 130 can include Dynamic Random Access Memory (DRAM), Static Random Access Memory (SRAM), NAND Flash memory, NOR flash memory, Ferroelectric Random Access Memory (FeRAM), or others. Memory device(s) 130 are accessible by processor(s) 110. I/O interfaces 120 are representative of any number and type of I/O interfaces (e.g., peripheral component interconnect (PCI) bus, PCI-Extended (PCI-X), PCIE (PCI Express) bus, gigabit Ethernet (GBE) bus, universal serial bus (USB)). Various types of peripheral devices can be coupled to I/O interfaces 120. Such peripheral devices include (but are not limited to) displays, keyboards, mice, printers, scanners, joysticks or other types of game controllers, media recording devices, external storage devices, network interface cards, and so forth.
In various embodiments, computing system 100 can be a computer, laptop, mobile device, server or any of various other types of computing systems or devices. It is noted that the number of components of computing system 100 can vary from embodiment to embodiment. There can be more or fewer of each component/subcomponent than the number shown in
Turning now to
In one embodiment, in a single cycle, a plurality of source data operands (or operands) for a vector instruction are read out of vector register file 210 and stored in temporary storage 215. In one embodiment, temporary storage 215 is implemented with a plurality of flip-flops. Then, in subsequent cycles, operands are retrieved out of temporary storage 215 and provided to individual instructions which are initiated for execution on first execution pipeline 220. Since first execution pipeline 220 does not access vector register file 210 during these subsequent cycles, second execution pipeline 230 is able to access vector register file 210 to retrieve operands to execute vector instructions which overlap with the individual instructions being executed by first execution pipeline 220. First execution pipeline 220 and second execution pipeline 230 utilize separate write ports to write results to vector destination cache 240.
In one embodiment, first execution pipeline 220 is a transcendental execution pipeline and second execution pipeline 230 is a vector arithmetic logic unit (VALU) pipeline. The VALU pipeline can also be implemented as a vector fused multiply-add (FMA) pipeline. In other embodiments, first execution pipeline 220 and/or second execution pipeline 230 can be other types of execution pipelines. It should be understood that while two separate types of execution pipelines are shown in stream processor 200, this is meant to illustrate one possible embodiment. In other embodiments, stream processor 200 can include other numbers of different types of execution pipelines which are coupled to a single vector register file.
Referring now to
In one embodiment, stream processor 300 is configured to execute vector instructions which have a vector width of four elements. It should be understood that while the architecture of stream processor 300 is shown to include four elements per vector instruction, this is merely indicative of one particular embodiment. In other embodiments, stream processor 300 can include other numbers (e.g., 2, 8, 16) of elements per vector instruction. Additionally, it should be understood that the bit widths of buses within stream processor 300 can be any suitable values which can vary according to the embodiment.
In one embodiment, transcendental execution pipeline 305 and FMA execution pipeline 310 share instruction operand buffer 315. In one embodiment, instruction operand buffer 315 is coupled to a vector register file (not shown). When a vector instruction targeting transcendental execution pipeline 305 is issued, the operands for the vector instruction are read in a single cycle and stored in temporary storage (e.g., flip-flops) 330. Then, in the next cycle, the first operation of the vector instruction accesses one or more first operands from the temporary storage 330 to initiate execution of the first operation on transcendental execution pipeline 305. The FMA execution pipeline 310 can access instruction operand buffer 315 in the same cycle that the first operation is initiated on transcendental execution pipeline 305. Similarly, in subsequent cycles, additional operands are accessed from flops 330 to initiate execution of operations for the same vector instruction on transcendental execution pipeline 305. In other words, the vector instruction is converted into multiple scalar instructions which are initiated in multiple clock cycles on transcendental execution pipeline 305. Meanwhile, while multiple scalar operations are being launched on transcendental execution pipeline 305, overlapping instructions can be executed on FMA execution pipeline 310.
Different stages of the pipelines are shown for both transcendental execution pipeline 305 and FMA execution pipeline 310. For example, stage 325 involves routing operands from the multiplexors (“muxes”) 320A-B to the inputs of the respective pipelines. Stage 335 involves performing a lookup to a lookup table (LUT) for transcendental execution pipeline 305 and performing a multiply operation on multiple operands for multiple vector elements for FMA execution pipeline 310. Stage 340 involves performing multiplies for transcendental execution pipeline 305 and performing addition operations on multiple operands for multiple vector elements for FMA execution pipeline 310. Stage 345 involves performing multiplies for transcendental execution pipeline 305 and performing normalization operations for multiple vector elements for FMA execution pipeline 310. Stage 350 involves performing addition operations for transcendental execution pipeline 305 and performing rounding operations for multiple vector elements for FMA execution pipeline 310. In stage 355, the data of transcendental execution pipeline 305 passes through a normalization and leading zero detection unit, and the outputs of the rounding stage are written to the vector destination cache for FMA execution pipeline 310. In stage 360, transcendental execution pipeline performs a rounding operation on the output from stage 355 and then the data is written to the vector destination cache. It is noted that in other embodiments, the transcendental execution pipeline 305 and/or FMA execution pipeline 310 can be structured differently.
Turning now to
In lane 405, which corresponds to instruction ID 0, a vector fused multiply-add (FMA) instruction is being executed on the FMA execution pipeline. Source data operands are read from the vector register file in cycle 0. Lane 410, which corresponds to instruction ID 1, illustrates the timing for a vector reciprocal instruction which is being executed on the transcendental execution pipeline. Pass 0 of the vector reciprocal instruction is initiated in cycle 1. In cycle 1, pass 0 of the vector reciprocal instruction reads all of the operands for the entire vector reciprocal instruction from the vector register file and stores them in temporary storage. It is noted that pass 0 refers to the first vector element being processed by the transcendental execution pipeline, with pass 1 referring to the second vector element being processed by the transcendental execution pipeline, and so on. In the embodiment illustrated by timing diagram 400, it is assumed that the width of the vector instructions is four elements. In other embodiments, other vector widths can be utilized.
Next, in cycle 2, a vector addition instruction is initiated on the FMA execution pipeline as shown in lane 415. Simultaneously with the vector addition instruction being initiated, in cycle 2, pass 1 of the vector reciprocal is initiated as shown in lane 420. The addition instruction shown in lane 415 accesses the vector register file in cycle 2, while pass 1 of the vector reciprocal instruction accesses an operand from the temporary storage. This prevents a conflict from occurring by preventing both the vector addition instruction and the vector reciprocal instruction from accessing the vector register file in the same clock cycle. By preventing a vector register file conflict, execution of the vector addition instruction of lane 415 is able to overlap with pass 1 of the vector reciprocal instruction shown in lane 420.
In cycle 3, the vector multiply instruction with instruction ID 3 is initiated on the FMA execution pipeline as shown in lane 425. Also in cycle 3, pass 2 of the vector reciprocal instruction is initiated on the transcendental execution pipeline as shown in lane 430. In cycle 4, the vector floor instruction with instruction ID 4 is initiated on the FMA execution pipeline as shown in lane 435. Also in cycle 4, pass 3 of the vector reciprocal instruction is initiated on the transcendental execution pipeline as shown in lane 440. In cycle 5, the vector fraction instruction with instruction ID 5 is initiated on the FMA execution pipeline as shown in lane 445. It is noted that in one embodiment, there are two write ports to the vector destination cache, allowing the transcendental execution pipeline and the FMA execution pipeline to write to the vector destination cache in the same clock cycle.
In lane 402, the timing of the allocation of cache lines in the vector destination cache is shown for the different instructions being executed on the execution pipelines. In one embodiment, cache lines are allocated early and aligned to avoid conflicts with allocations for other instructions. In cycle 4, a cache line is allocated in the vector destination cache for the FMA instruction shown in lane 405. In cycle 5, a cache line is allocated in the vector destination cache to store results for all four passes of the reciprocal instruction. In cycle 6, a cache line is allocated in the vector destination cache for the add instruction shown in lane 415. In cycle 7, a cache line is allocated in the vector destination cache for the multiply instruction shown in lane 425. In cycle 8, a cache line is allocated in the vector destination cache for the floor instruction shown in lane 435. In cycle 9, a cache line is allocated in the vector destination cache for the fraction instruction shown in lane 445. It is noted that two cache lines are not allocated in a single cycle since the cache line for the transcendental pipeline is allocated earlier during the first pass so that the allocation does not conflict with any of the instructions being executed on the FMA execution pipeline. It is also noted that multiple write ports are implemented for the vector destination cache to avoid write conflicts between the transcendental pipeline and the FMA execution pipeline.
Referring now to
A processor initiates, on a first execution pipeline, execution of a first type of instruction on a first vector element in a first clock cycle (block 505). In one embodiment, the first execution pipeline is a transcendental pipeline and the first type of instruction is a vector transcendental instruction. It is noted that “initiating execution” is defined as providing operand(s) and/or an indication of the instruction to be performed to a first stage of an execution pipeline. The first stage of the execution pipeline then starts processing the operand(s) in accordance with the functionality of the processing elements of the first stage.
Next, the processor initiates, on the first execution pipeline, execution of the first type of instruction on a second vector element in a second clock cycle, wherein the second clock cycle is subsequent to the first clock cycle (block 510). Then, the processor initiates execution, on a second execution pipeline, of a second type of instruction on a vector having a plurality of elements in the second clock cycle (block 515). In one embodiment, the second execution pipeline is a vector arithmetic logic unit (VALU) and the second type of instruction is a vector fused multiply-add (FMA) instruction. After block 515, method 500 ends.
Turning now to
Then, the first plurality of operands are accessed from the temporary storage to initiate execution of multiple vector elements of the first vector instruction on a first execution pipeline in subsequent clock cycles (block 615). It is noted that the first execution pipeline does not access the vector register file during the subsequent clock cycles. Additionally, a second plurality of operands are retrieved from the vector register file during the subsequent clock cycles to initiate execution of one or more second vector instructions on the second execution pipeline (block 620). It is noted that the second execution pipeline can access the vector register file multiple times during the subsequent clock cycles to initiate multiple second vector instructions on the second execution pipeline. Since the first execution pipeline is not accessing the vector register file during the subsequent clock cycles, the second execution pipeline is able to access the vector register file to obtain operands for executing overlapping instructions. After block 620, method 600 ends.
Referring now to
Otherwise, if the given vector instruction is a first type of instruction (conditional block 715, “first” leg), then the processor issues the given vector instruction on a first execution pipeline (block 725). In one embodiment, the second type of instruction is a vector fused multiply-add instruction and the second execution pipeline is a vector arithmetic logic unit (VALU). After blocks 720 and 725, method 700 ends. It is noted that method 700 can be performed for each vector instruction detected in the instruction stream.
Turning now to
If a pre-transcendental instruction was scheduled less than four cycles ago (conditional block 825, “yes” leg), then the instruction arbiter determines if the next ready wave's instruction is a non-transcendental instruction (conditional block 830). If a pre-transcendental instruction was not scheduled less than four cycles ago (conditional block 825, “no” leg), then the instruction arbiter issues this transcendental instruction (block 835). After block 835, method 800 returns to block 810. If the next ready wave's instruction is a non-transcendental instruction (conditional block 830, “yes” leg), then the instruction arbiter issues this non-transcendental instruction (block 840). After block 840, method 800 returns to block 810. If the next ready wave's instruction is a transcendental instruction (conditional block 830, “no” leg), then method 800 returns to block 810.
In various embodiments, program instructions of a software application are used to implement the methods and/or mechanisms previously described. The program instructions describe the behavior of hardware in a high-level programming language, such as C. Alternatively, a hardware design language (HDL) is used, such as Verilog. The program instructions are stored on a non-transitory computer readable storage medium. Numerous types of storage media are available. The storage medium is accessible by a computing system during use to provide the program instructions and accompanying data to the computing system for program execution. The computing system includes at least one or more memories and one or more processors configured to execute program instructions.
It should be emphasized that the above-described embodiments are only non-limiting examples of implementations. Numerous variations and modifications will become apparent to those skilled in the art once the above disclosure is fully appreciated. It is intended that the following claims be interpreted to embrace all such variations and modifications.
Number | Date | Country | Kind |
---|---|---|---|
201710527119.8 | Jun 2017 | CN | national |