The disclosure generally relates to a microprocessor, and ore specifically, to a method and a microprocessor for processing vector instructions.
Single Instruction Multiple Data (SIMD) architectures achieve high performance by executing multiple data elements designated by SIMD instruction (also referred to as vector instruction) in parallel, whereas a scalar instruction processes only one data element or a pair of data elements (i.e., two source operands). Each of the data elements may represent an individual piece of data (e.g., pixel data, graphical coordinate, etc.) that is stored in the register or other storage location along with other data elements commonly having the same size. The number of data elements designated by the vector instruction greatly varies based on the data element size and vector length multiplier (LMUL). For example, when LMUL is 1, a 512-bit wide vector data may have sixty-four 8-bit wide elements, thirty-two 16-bit wide data elements, sixteen 32-bit wide data elements, and so on. When LMUL is 8, the 512-bit wide vector data may have five hundred and twelve 8-bit wide data elements, two hundred and fifty-six 16-bit wide data elements, one hundred and twenty-eight 32-bit wide data elements, and so on.
In processing of vector instruction, each data element of the vector register would be attached with a mask bit which masks the corresponding data element to the designated operation when enabled. In a worst-case scenario, all 512 bits of a mask vector register would be used for five hundred and twelve 8-bit wide data elements. On the other hand, only 16 bits of a 512-bit mask vector register are needed for a 512-bit wide vector data having sixteen 32-bit wide data elements. Although not all vector instruction is worst case scenario, each vector instruction is still issued with 512-bit mask data to cover the all possibilities of predication, regardless of whether all 512-bit mask data are required by the vector instruction or not (i.e., brute-force implementation of mask data). Such implementation of mask data with vector instruction takes up a lot of storage area and power for pluralities of queued vector instructions in the pluralities of execution pipelines of a vector processor.
The disclosure introduces a mask queue that manages the mask data of data elements corresponding to the vector instruction(s) that is issued from a decode/issue unit to an execution queue.
In the disclosure, the mask data corresponding to data elements of the issued instruction may be handled or managed by the introduced mask queue, where only the valid mask data for all vector instructions in an execution queue are stored to the mask queue. In the disclosure, mask data of multiple vector instructions may be stored in the mask queue. The corresponding mask data may be accessed from the mask queue when the vector instruction(s) is dispatched from the execution queue to the functional unit for execution. Issuing of the vector instruction from the decode/issue unit to the execution queue may be stalled if the mask queue does not have enough available entries. In some embodiments, one mask queue may be dedicated to one execution queue. In some other embodiments, one mask queue may be shared between two different execution queues. In the disclosure, resources are conserved without dedicating additional storage space for handling mask data of the vector instruction. That is, the mask queue would only read the mask data required by the vector instruction(s) from the mask register when the vector instruction(s) is issued to the execution queue. The implementation of the mask queue greatly reduces the resource required for managing the mask of data elements for processing vector instruction.
Aspects of the present disclosure are best understood from the following detailed description when read with the accompanying figures. It is noted that, in accordance with the standard practice in the industry, various features are not drawn to scale. In fact, the dimensions of the various features may be arbitrarily increased or reduced for clarity of discussion.
The following disclosure provides many different embodiments, or examples, for implementing different features of the present disclosure. Specific examples of components and arrangements are described below to simplify the present disclosure. These are, of course, merely examples and are not intended to be limiting. For example, the formation of a first feature over or on a second feature in the description that follows may include embodiments in which the first and second features are formed in direct contact, and may also include embodiments in which additional features may be formed between the first and second features, such that the first and second features may not be in direct contact. In addition, the present disclosure may repeat reference numerals and/or letters in the various examples. This repetition is for the purpose of simplicity and clarity and does not in itself dictate a relationship between the various embodiments and/or configurations discussed.
The disclosure introduces a mask queue that manages the mask data of data elements corresponding to the vector instruction(s) that is issued. In brute-force implementation of mask operation, each instruction is issued with entire mask register (e.g., 512 bits) to an execution queue, regardless of whether all of the mask data would be required or not. If the execution queue has 8 entries, the mask storage would be 4096 bits (i.e., 8×512). If there are 8 execution queues, the mask storage would be 32768 bits (i.e., 8×8×512). Since not all of the vector instruction(s) would use 512 bits of mask, the mask storage of the brute-force implementation is wasteful. In the disclosure, the mask queue would only read the mask data required by the vector instruction(s) from the mask register when the vector instruction(s) is issued to the execution queue. The implementation of the mask queue greatly reduces the resource required for managing the mask of data elements for processing vector instruction.
Referring to
The microprocessor 10 may be a general-purpose processor (e.g., a central processing unit) or a special-purpose processor (e.g., network processor, communication processor, DPSs, embedded processor, etc.) The processor may have any of the instruction set architectures such as Complex Reduced Instruction Set Computing (CISC), Reduced Instruction Set Computing (RISC), Very Long Instruction Word (VLIW), hybrids thereof, or other type of instruction set architectures. In some of the embodiments, the microprocessor is a RISC processor that performs predication or masking on vector instructions. The microprocessor implements an instruction-level parallelism within a single microprocessor and achieves high performance by executing multiple instructions per clock cycle. Multiple instructions are dispatched to different functional units for parallel execution. The superscalar microprocessor may employ out-of-order (OOO) execution, in which a second instruction without any dependency on a first instruction may be executed prior to the first instruction. In traditional out-of-order microprocessor design, the instructions can be executed out-of-order but they must retire to a register file of the microprocessor in-order because of control hazards such as branch misprediction, interrupt, and precise exception. Temporary storages such as re-order buffer and register renaming are used for the result data until the instruction is retired in-order from the execution pipeline. The microprocessor 10 may execute and retire instruction out-of-order by write back result data out-of-order to the register file as long as the instruction has no data dependency and no control hazard.
Referring to
In some embodiments, the instruction cache 11 is coupled (not shown) to the memory 30 and the decode/issue unit 13, and is configured to store instructions that are fetched from the memory 30 and dispatch the instructions to the decode/issue unit 13. The instruction cache 11 includes many cache lines of contiguous instruction bytes from memory 30. The cache lines are organized as direct mapping, fully associative mapping or set-associative mapping, and the likes. The direct mapping, the fully associative mapping and the set-associative mapping are well-known in the relevant art, thus the detailed description about the above mappings are omitted.
The instruction cache 11 may include a tag array (not shown) and a data array (not shown) for respectively storing a portion of the address and the data of frequently-used instructions that are used by the microprocessor 10. Each tag in the tag array is corresponding to a cache line in the data array. When the microprocessor 10 needs to execute an instruction, the microprocessor 10 first checks for an existence of the instruction in the instruction cache 11 by comparing address of the instruction to tags stored in the tag array. If the instruction address matches with one of the tags in the tag array (i.e., a cache hit), then the corresponding cache line is fetched from the data array. If the instruction address does not match with any entry in the tag array (i.e., a cache miss), the microprocessor 10 may access the memory 30 to find the instruction. In some embodiments, the microprocessor 10 further includes an instruction queue (not shown) that is coupled to the instruction cache 11 and the decode/issue unit 13 for storing the instructions from the instruction cache 11 or memory 30 before sending the instructions to the decode/issue unit 13.
The BPU 12 is coupled to the instruction cache 11 and is configured to speculatively fetch instructions subsequent to branch instructions. The BPU 12 may provide prediction to branch direction (taken or not taken) of branch instructions based on the past behaviors of the branch instructions and provide the predicted branch target addresses of the taken branch instruction. The branch direction may be “taken”, in which subsequent instructions are fetched from the branch target addresses of the taken branch instruction. The branch direction may be “not taken”, in which subsequent instructions are fetched from memory locations consecutive to the branch instruction. In some embodiments, the BPU 12 implements a basic block branch prediction for predicting the end of a basic block from starting address of the basic block. The starting address of the basic block (e.g., address of the first instruction of the basic block) may be the target address of a previously taken branch instruction. The ending address of the basic block is the instruction address after the last instruction of the basic block which may be the starting address of another basic block. The basic block may include a number of instructions, and the basic block ends when a branch in the basic block is taken to jump to another basic block.
The decode/issue unit 13 may decode the instructions received from the instruction cache 11. The instruction may include the following fields: an operation code (or opcode), operands (e.g., source operands and destination operands), and an immediate data. The opcode may specify which operation (e.g., ADD, SUBTRACT, SHIFT, STORE, LOAD, etc) to carry out.
The operand may specify the index or address of a register in the register file 14, where the source operand indicates a register from the register file from which the operation would read, and the destination operand indicate a register in the register file to which a result data of the operation would write back. It should be noted that the source operand and destination operand may also be referred to as source register and destination register, which may be used interchangeably hereinafter. In the embodiment, the operand would need 5-bit index to identify a register in a register file that has 32 registers. Some instructions may use the immediate data as specified in the instruction instead of the register data. Each instruction would be executed in a functional unit 20 or the load/store unit 17. Based on the type of operation specified by the opcode and availability of the resources (e.g., register, functional unit, etc.), each instruction would have an execution latency time and a throughput time. The execution latency time (or latency time) refers to the amount of time (i.e., the number of clock cycles) for the execution of the operation specified by the instruction(s) to complete and writeback the result data. The throughput time refers to the amount of time (i.e., the number of clock cycles) when the next instruction can enter the functional unit 20.
In the embodiments, instructions are decoded in the decode/issue unit 13 to obtain the execution latency time, the throughput time, and instruction type based on the opcode. Multiple instructions may be issued to one execution queue 19, where the throughput time of multiple instructions are accumulated. The accumulated throughput time indicates when the next instruction to be issued can enter the functional unit 20 for execution (i.e., the amount of time an instruction must wait before entering the functional unit 20) in view of the previously issued instruction(s) in the execution queue 19. The time defining when an instruction to be issued can be sent to the functional unit 20 is referred to as read time (from the register file) and the time defining when the instruction is completed by the functional unit 20 is referred to referred to as write time (to the register file). The instructions are issued to the execution queues 19 where each issued instruction has a scheduled read time to dispatch to the corresponding functional unit 20 or load/store unit 17 for execution. At the issue of an instruction, the accumulated throughput time of the issued instruction(s) in the execution queue 19 is the read time of the instruction to be issued. The execution latency time of the instruction to be issued is added to the accumulated throughput time to generate the write time when the instruction is issued to the next available entry of the execution queue 19. The modified execution latency time would be referred to herein as a write time of the most recent issued instruction, and the modified start time would be referred to herein as a read time of a next instruction to be issued. The write time and read time may also be collectively referred to as an access time which describes a particular time point for the issued instruction to write to or read from a register of the register file 14. Since the source register(s) is scheduled to read from the register file 14 just in time for execution by the functional unit 20, no temporary register is needed in the execution queue for source register(s) which is an advantage in comparison to other microprocessor in one of the embodiments. Since the destination register is scheduled to write back to the register file 14 from the functional unit 20 or data cache 24 at the exact time in the future, no temporary register is needed to store the result data if there are conflicts with other functional units 20 or data cache 24 in one of the embodiments, which is an advantage in comparison to other microprocessor. For parallel issuing of more than one instruction, the write time and the read time of a second instruction may be further adjusted based on a first instruction which was issued prior to the second instruction.
For vector processing, the decode/issue unit 13 reads mask data from mask vector register v(0) of the register file 14 and attached the mask data with the vector instruction to the execution queue 19. Each execution queue 19 includes a mask queue 21 to keep the mask data for each issued vector instruction in the execution queue 19. When the instruction is dispatched from the execution queue 19 to the functional unit 20, the mask data is read (if the mask operation is enabled) from the mask queue 21 and sent with the instruction to the functional unit 20.
In the embodiments, the decode/issue unit 13 is configured to check and resolve all possible conflicts before issuing the instruction. An instruction may have the following 4 basic types of conflicts: (1) data dependency which includes write-after-read (WAR), read-after-write (RAW), and write-after-write (WAW) dependencies, (2) availability of read port to read data from the register file to the functional unit, (3) availability of the write port to write back data from the functional unit to the register file, and (4) the availability of the functional unit 160 to execute data. The decode/issue unit 13 may access the scoreboard 15 to check data dependency before the instruction can be issued to the execution queue 19. Furthermore, the register file 14 has limited number of read and write ports, and the issued instructions must arbitrate or reserve the read and write ports to access the register file 14 in future times. The decode/issue unit 13 may access the read/write control unit 16 to check the availability of the read ports and write ports of the register file 14, as to schedule the access time (i.e., read and write times) of the instruction. In other embodiments, one of the write ports may be dedicated for instruction with unknown write time to write back to the register file 14 without using the write port control, and one of the read ports may be reserved for instructions with unknown read time to read data from the register file 14 without using the read port control. The read ports of the register file 14 can be dynamically reserved (not dedicated) for the read operations having unknown access time. In this case, the functional unit 20 must ensure that the read port is not busy when trying to read data from the register file 14. In the embodiments, the availability of the functional unit 20 may be resolved by coordinating with the execution queue 19 where the throughput times of queued instructions (i.e., previously issued to the execution queue) are accumulated. Based on the accumulated throughput time in the execution queue, the instruction may be dispatched to the execution queue 19, where the instruction may be scheduled to be issued to the functional unit 20 at a specific time in the future at which the functional unit 20 is available.
The unknown field 1511 includes a bit value that indicates whether the write time of a register corresponding to the scoreboard entry is known or unknown. For example, the unknown field 1511 may include one bit or any number of bits, where a non-zero value indicates that the register has unknown write time, and a zero value indicates that the register has known write time as indicated by the write count field 1513. The unknown field 1511 may be set or modified at the issue time of an instruction and reset after the unknown register write time is resolved. For example, the reset operation may be performed by either the decode/issue unit 13, a load/store unit 17 (e.g., after a data hit), or a functional unit 20 (e.g., after INT DIV operation resolve the number of digits to divide), and other units in the microprocessor that involves execution of instruction with unknown write time. The write count field 1513 records a write count value that indicates the number of clock cycles before the register can be accessed by the next instruction (that is to be issued). In other words, the write count field 1513 records the number of clock cycles for which the previously issued instruction(s) would complete the operation and writeback the result data to the register. The write count value of the write count field 1513 is set based on the write time (may also be referred to as execution latency time) of an instruction at the issue time of the instruction. Then, the write count value counts down (i.e., decrement by one) for every clock cycle until the count value become zero (i.e., a self-reset counter). The functional unit field 1515 of the scoreboard entry specifies a functional unit 20 (designated by the issued instruction) that is to write back to the register.
With reference to
With reference to
The vector execution queues 19 are configured to hold issued vector instructions which are scheduled to be dispatched to the functional units 20. In the embodiments, each vector execution queue 19 includes a mask queue 21 that stores mask data corresponding to the vector instructions issued to the execution queue 19. With reference to
The valid field 191 indicates whether an entry is valid or not (e.g., valid entry is indicated by “1” and invalid entry is indicated by “0”). The execution control data field 193 indicates an execution control information for the corresponding functional unit 20 which is derived from the received vector instruction. The address field 195 records the address of register to which the vector instruction accesses. The throughput count field 197 records a throughput count value that represents the number of clock cycles for the functional unit 20 to accept the vector instruction corresponding to the execution queue entry. In other words, the functional unit 20 would be free to accept the vector instruction in the vector execution queue 19 after the number of clock cycles specified in the throughput count field 197 expires. The throughput count value is counted down by one for every clock cycle until throughput count value reaches zero. When the throughput count value reaches 0, the execution queue 19 dispatches the vector instruction in the corresponding execution queue entry to the functional unit 20. The micro-op field 198 records a micro-op count value representing the number of micro-operations that is specified by the vector instruction of the execution queue entry. The micro-op count value decrements by one for every dispatching of one micro-op until the micro-op count value reaches 0. The corresponding execution queue entry can only be invalidated and start processing the subsequent execution queue entry when the micro-op count value and the throughput count value of the current execution queue entry are 0.
The execution queue 19 may include or couple to an accumulate counter 199 for storing an accumulate count value acc_cnt that is counted down by one for every clock cycle until the counter value becomes zero. The accumulative count of zero indicates that the execution queue 19 is empty. The accumulate count value acc_cnt of accumulate counter 199 indicates a time (i.e., the number of clock cycles) in the future at which the next instruction in decode/issue unit 13 can be dispatched to the functional units 20 or the load/store unit 17 via the execution queue 19. In some embodiments, the read time of the instruction is the accumulate count value, and the accumulate count value is set according to the sum of current acc_cnt and the instruction throughput time (acc_cnt=acc_cnt+inst_xput_time) for the next instruction. In some other embodiments, the read time may be modified, and the accumulate count value acc_cnt is set according to a sum of a read time (rd_cnt) of the instruction and a throughput time of the instruction (acc_cnt=rd_cnt+inst_xput_time) for the next instruction. In some embodiments, the read shifters 161 and the write shifters 163 are designed to be synchronized with the execution queue 19. For example, the execution queue 19 may dispatch the instruction to the functional unit 20 or load/store unit 17 at the same time as the source registers are read from the register file 14 according to the read shifters 161, and the result data from the functional unit 20 or the load/store unit 17 are written back to the register file 14 according to the write shifters 163.
For example, two execution queue entries 190(0), 190(1) are valid and respectively records a first instruction and a second instruction issued after the first instruction. The first instruction in the execution queue entry 190(0) has a throughput time of 5 clock cycles as recorded in the throughput count value 197 and micro-op count of 4 as recorded in the mop_cnt field 198. In the example, one micro-op of the first instruction would be sent to the functional unit 20 for every 5 clock cycles until the micro-op count reaches 0. The total execution throughput time of the first instruction in the first execution queue entry 190(0) would be 20 clock cycles (i.e., 5 clock cycles X 4 micro-operations). Similarly, the total execution throughput time for the second instruction in the second execution queue entry 190(1) would be 16 clock cycles, since there are 8 micro-ops and each has execution throughput time of 2 clock cycles. The accumulate throughput counter 199 would be set to 36 clock cycles, which would be used for issuing a third instruction to the next available execution queue entry (i.e., a third execution queue entry 190(2)).
With reference to
The data cache 18 is coupled to the register file 14, the memory 30 and the load/store unit 17 and configured to temporary store data that are fetched from the memory 30. The load/store unit 17 accesses the data cache 18 for load data or store data. The data cache 18 includes many cache lines of contiguous instruction bytes from memory 30. The cache lines of data cache 18 are organized as direct mapping, fully associative mapping or set-associative mapping similar to the instruction cache 11 but not necessary the same mapping as with the instruction cache 11. The data cache 18 may include a tag array (TA) 22 and a data array (DA) 24 for respectively storing a portion of the address and the data frequently-used by the microprocessor 10. Each tag in the tag array 22 is corresponding to a cache line in the data array 24. When the microprocessor 10 needs to execute the load/store instruction, the microprocessor 10 first checks for an existence of the load/store data in the data cache 18 by comparing the load/store address to tags stored in the tag array 22. The TEQ 19A dispatches the tag operation to an address generation unit (AGU) 171 of the load/store unit 17 to calculate a load/store address. The load/store address is used to access a tag array (TA) 22 of the data cache 18. If the load/store address matches with one of the tag in the tag array (cache hit), then the corresponding cache line in the data array 24 is accessed for load/store data. If the load/store address does not match with any entry in the tag array 22 (cache miss), the microprocessor 10 may access the memory 30 to find the data. In case of cache hit, the execution latency of the load/store instruction is known. In case of cache miss, the execution latency of the load/store instruction is unknown. In some embodiment, the load/store instruction may be issued based on known execution latency of assumed cache hit, which may be a predetermined count value (e.g, 2, 3, 6, or any number of clock cycles). When cache miss is encountered, the issuing of load/store instruction may configure the scoreboard 15 to indicate a corresponding register has a data dependency with unknown execution latency time.
In the following, a process of issuing an instruction with known access time by using the scoreboard 15, accumulated throughput time of the instructions in the execution queue 19 and the read/write control unit 16 would be explained.
When the decode/issue unit 13 receives an instruction from the instruction cache 11, the decode/issue unit 13 access the scoreboard 15 to check for any data dependencies before issuing the instruction. Specifically, the unknown field and count field of the scoreboard entry corresponding to the register would be check for determining whether the previously issued instruction has a known access time. In some embodiments, the current accumulated count value of the accumulate counter 199 may also be access for checking the availability of the functional unit 20. If a previously issued instruction (i.e., a first instruction) and the received instruction (i.e., a second instruction) which is to be issued are to access the same register, the second instruction may have a data dependency. The second instruction is received and to be issued after the first instruction. Generally, data dependency can be classified into a write-after-write (WAW) dependency, a read-after-write (RAW) dependency and a write-after-read (WAR) dependency. The WAW dependency refers to a situation where the second instruction must wait for the first instruction to write back the result data to a register before the second instruction can write to the same register. The RAW dependency refers to a situation where the second instruction must wait for the first instruction to write back to a register before the second instruction can read data from the same register. The WAR dependency refers to a situation where the second instruction must wait for the first instruction to read data from a register before the second instruction can write to the same register. With scoreboard 15 and execution queue 19 described above, instructions with known access time may be issued and scheduled to a future time to avoid these data dependencies.
In an embodiment of handling RAW data dependency, if the count value of the count field 153 is equal or less than the read time of the instruction to be issued (i.e., inst read time), then there is no RAW dependency, and the decode/issue unit may issue the instruction. If the count value of the count field 153 is greater than a sum of the instruction read time and 1 (e.g., inst read time+1), there is RAW data dependency, and the decode/issue unit 13 may stall the issue of the instruction. If the count value of the count field 153 is equal to sum of the instruction read time and 1 (e.g., inst read time+1), the result data may be forwarded from the functional unit recorded in the functional unit field 155. In such case, the instruction with RAW data dependency can still be issued. The functional unit field 155 may be used for forwarding of result data from the recorded functional unit to a functional unit of the instruction to be issued. In an embodiment of handling a WAW data dependency, if the count value of the count field 153 is greater than or equal to the write time of the instruction to be issued, then there is WAW data dependency and the decode/issue unit 13 may stall the issuing of the instruction. In an embodiment of handling a WAR data dependency, if the count value of count field 153 (which records the read time of previously issued instruction) is greater than the write time of the instruction, then there is WAR data dependency, and the decode/issue unit 13 may stall the issue of the instruction. If the count value of the count field 153 is less than or equal to the write time of the instruction, then there is no WAR data dependency, and the decode/issue unit 13 may issue the instruction.
Based on the count value in the count field of the scoreboard 15, the decode/issue unit 13 may anticipate the availability of the registers and schedule the execution of instructions to the execution queue 19, where the execution queue 19 may dispatch the queued instruction(s) to the functional unit 20 in an order of which the queued instruction(s) is received from the decode/issue unit 13. The execution queue 19 may accumulate the throughput time of queued instructions in the execution queue 19 to anticipate the next free clock cycle at which the execution queue 19 is available for executing the next instruction. The decode/issue unit 13 may also synchronize the read ports and write ports of the register file by accessing the read/write control unit 16 to check the availability of the read ports and writes ports of the register file 14 before issuing the instruction. For example, the accumulated throughput time of the first instruction(s) in the execution queue 19 indicates that the functional unit 20 would be occupied by the first instruction(s) for 11 clock cycles. If the write time of the second instruction is 12 clock cycles, then the result data will be written back from the functional unit 20 to the register file 14 at time 23 (or the 23rd clock cycle from now) in the future. In other words, the decode/issue unit 13 would ensure the availability of the register and the read port at 11th clock cycle and availability of the write port for writeback operation at 23rd clock cycle at the issue time of the second instruction. If the read port or write port is busy in the corresponding clock cycles, the decode/issue unit 13 may stall for one clock cycle and check the availabilities of the register and read/write ports again.
The mask queue 21 handles the mask data of the vector instruction issued to the execution queue 19.
The mask operation may represent in a predicate operand or conditional control operand, or conditional vector operation control operand. In the embodiments, the mask operation may be enabled based on bit 25 of the vector instruction. In other words, the bit 25 of the vector instruction indicates the vector instruction is a masked vector instruction or an unmasked vector instruction. Other bits in the vector instruction may be used for enabling mask operation, the disclosure is not intended to limit thereto. The mask data of the mask queue 21 may be used to predicate, conditionally control, or mask whether or not individual results of the operations are to be stored as the data elements of destination operand and/or whether or not operations associated with the vector instruction are to be performed on the data elements of source operand. Typically, one mask bit would be attached to each data element of vector instruction. The number of mask data varies based on vector data length (VLEN), data element width (ELEN), and vector length multiplier (LMUL). The vector length multiplier represents the number of vector registers that are combined to form a vector register group. The value of the vector length multiplier may be 1,2,4,8, and so on. The number of the data elements may be calculated by dividing vector data length by data element width (VLEN/ELEN), and each data element would require a mask bit when the mask operation is enabled. With vector length multiplier, one single vector instruction may include various number of micro-ops, and each of the micro-ops also requires mask data to perform mask operation. In a case of vector length multiplier being 8, i.e., LMUL=8, the number of data elements for one single instruction increases by 8 times as compare to LMUL=1 (i.e., (VLEN×LMUL)/ELEN). In a case of 512-bit wide vector data and 8 micro-ops, the number of data elements for one single vector instruction may be as large as 512 elements when the data element width is 8-bit (ELEN=8). In such case, 512-bit mask data are required, which may be referred to as a worst-case scenario of the 512-bit mask register. On the other hand, a vector instruction having data element width of 32-bit and 1 micro-op (LMUL=1) would only require 16-bit mask data, which may be referred to as a best-case scenario for mask register). In the brute-force implementation, each entry of execution queue would be equipped with 512 bits for handling the possibility of 512-bit mask data regardless of whether the vector instruction in the execution queue requires all of the 512 bits or not. If the execution queue has 8 entries, a total of 32,768 bits (8×8×512) would be required to handle masks in the worst-case scenario for every entry in the execution queue. This is an excessive storage for mask. In the embodiments, the mask queue is dedicated to handle the mask of vector data for all of the queue entries of the execution queue instead reserving 512 bits in every queue entry for handling mask.
In the embodiments of 512-bit wide mask register v(0), 32 mask entries of the mask queue 21 would have the capability to handle mask data for 32 vector instructions having 32-bit wide data element when LMUL being 1, where only the first 16 bits of mask register are mask data for the vector registers (i.e., the best-case scenario). When LMUL is 8, the mask queue 21 has the capability to handle 4 vector instructions having 32-bit wide data element, where first 128 bits of the mask register are mask data for the vector registers. For 16-bit wide data element, the mask queue 21 has the capability to handle mask data for 16 vector instructions when LMUL being 1 (i.e., 32 mask data) and 2 vector instructions when LMUL being 8 (i.e., 256 mask data). For 8-bit wide data element, the mask queue 21 has the capability to handle mask data for 8 vector instructions when LMUL being 1 (i.e., 64 mask data) and 1 vector instruction when LMUL=8 (i.e., 512 bits of mask data).
It should be noted that 512-bit wide mask register is utilized to show the concept of the invention, mask register having different width such as 32, 63, 128, 1024 bits may also be adapted to handle the mask data of the mask operation. For example, in an embodiment of 1024-bit wide mask register, the microprocessor may be equipped with a 1024-bit wide mask queue to handle the worst-case scenario (e.g., VLEN=1024, ELEN=8, MLUL=8). Furthermore, the width of data element and the number of vector length multiplier (LMUL) may also varies without departing from the scope of the invention. The same algorithm of mask queue as described in the specification may also be adapted to handle data element having a wide of 64 bits, 128 bits, etc.
In the embodiments, the mask queue 21 may be accessed by rotating pointers such as a write pointer (“wrptr”) 211 and a read pointer (“rdptr”) 213. The write pointer 211 is incremented per allocation of one vector instruction in the execution queue. The read pointer 213 is incremented per completion of one vector instruction. The mask data are written to the mask queue 21 as one entry of R bits (e.g., R=512 bits) and read from the mask queue 21 as M entries (e.g., M=32 entries).
In a write operation of the mask queue 21, the entire width of mask register v(0) may be written to the mask queue 21 as one entry when a vector instruction is issued to the execution queue 19. That is, 512 bits (i.e., the total width of mask register v(0)) may be written to the mask queue 21 starting from a mask entry specified by a position of the write pointer 211. To be specific, the 32 entries of the mask queues are enabled for writing of the 512-bit mask data starting from the write pointer 211. The relocation of the write pointer 211 may be calculated based on the number of mask data required by the vector data of the issued vector instruction. For example, a first vector instruction has 2 micro-ops (i.e., LMUL=2), and the vector data has a data element width (ELEN) of 16 bits. The vector data would have 64 data elements, which requires the first 64 bits of the mask register v(0) for mask operation. The relocation of the write pointer 211 would be calculated based on 64-bit (4×16) mask data required by the first vector instruction. To be specific, the 4 entries of the mask queues 21 are enabled for writing of the 64-bit mask data starting from the write pointer 211. The mask queue 21 is written as a single entry of 512-bit mask data but only 4 mask entries are enabled for writing of the 64-bit mask data while the 448-bit mask data are blocked from writing into the mask queue 21. Each mask entry may be assigned with a write enable bit (not shown) for indicating whether a corresponding mask entry is enabled for write operation or not. In the example, the write pointer 211 would be incremented by 4 entries. If the first vector instruction is issued to the first queue entry 190(0) of the execution queue 19, the write operation would start from the first mask entry 210(0) and the write pointer 211 would be incremented from the first mask entry 210(0) to the fifth mask entry 210(4). When a second vector instruction is issued to the second queue entry 190(1), the entire width of the mask register v(0) would be used to write to the mask queue 21 as one entry. The write operation of mask queue 21 for the mask data of the second vector instruction would start from the new position of the write pointer, i.e., 5th mask entry 210(4). If the second vector instruction is the worst-case scenario that requires the entire width (e.g., 512 bits when VLEN=512, ELEN=8 bits and LMUL=8) of the mask queue 21, the issuing of the second vector instruction would be stalled in the decode/issue unit until the first vector instruction is dispatched to the functional unit 20. In another scenario, if the first vector instruction is the worst-case scenario that requires the entire width of the mask queue 21 for mask operation, the issuing of the second vector instruction subsequent to the first vector instruction would be stalled until the first vector instruction is dispatched to the functional unit 20. However, in the alternative embodiments of mask queue 21 not being dependent from the width of the mask register, the size of the mask queue 21 may handle more mask bits in addition to the 512-bit mask data in the worst-case scenario of 512-bit wide vector register. In such embodiments, the second instruction may still be issued after the first vector instruction that has the worst-case scenario as long as the number of available mask entries in the mask queue 21 is sufficient to handle the mask data corresponding to the second vector instruction. In any cases, the vector instruction may be stall in the decode/issue unit 13 until the number of available mask entries in the mask queue 21 are enough to hold the new mask data for the vector instruction.
The read operation of mask queue 21 starts from the mask entry that is pointed to by the read pointer 213 and increments when the corresponding vector instruction in the execution queue 19 is dispatched to the functional unit 20. The vector instruction may have many micro-ops as indicated by micro-op count field 198 where the micro-op count field 198 is decremented by 1 every time a micro-op is dispatched to the functional unit 20. All micro-ops of the vector instruction are issued to the functional unit 20 when a count value of 0 is in the micro-op count field 198 in which case the valid bit field 191 of the entry of execution queue 19 is reset and the read pointer of the mask queue 21 can be incremented. The read pointer 213 points to one of the mask entries corresponding to the first micro-op of the vector instruction (referred to as a current read mask entry). The read operation may read X consecutive mask entries starting from the current read mask entry, where X may be an integer greater than 0. The current read mask entry may be offset by the order of micro-ops to read the corresponding mask data stored in the mask queue 21. For 8-bit elements, the number of mask bits required for a vector operation is 64-bit mask data or 4 mask entries of the mask queue 21. For 16-bit elements, the number of mask bits required for a vector operation is 32-bit mask data or 2 mask entries of the mask queue 21. For 32-bit elements, the number of mask bits required for a vector operation is 16-bit mask data or 1 mask entry of the mask queue 21. The number of mask entries for each micro-op is referred to herein as the micro-op mask size, i.e. 4 mask entries for 8-bit element, 2 mask entries for 16-bit element, and 1 mask entry for 32-bit element. In the embodiments, instead of calculating the exact number of mask entries to read for different element length (8-bit, 16-bit, and 32-bit elements), four mask entries (i.e., X=4) are read each time. In the case of 8-bit wide data element, all 4 entries are used for each micro-op. In the case of 16-bit wide data element, the first 2 entries are used for each micro-op. In the case of 32-bit wide data element, the first entry is used for each micro-op. Therefore, the read operation of the mask queue 21 is configured to read at least 64 bits, i.e., four 16-bit wide mark entries, to handle each micro-op that may have various widths of the mask data due to the data element width.
As described above, the current read mask entry may be offset by the order of the micro-ops. If a vector instruction has three micro-ops, the first micro-op would read 4 consecutive mask entries starting from the mask entry pointed by the read pointer 213. The second micro-op would read 4 consecutive mask entries starting from a modified read pointer. In the embodiments, the read pointer is modified by adding the micro-op mask size (which depends on the width of the data elements) to the read pointer 213. The third micro-op would read 4 consecutive mask entries starting from the modified read pointer by adding 2 micro-op mask sizes to the read-pointer 213. In the case of 32-bit wide data element, which has a micro-op mask size of one mask entry (i.e., 16 mask bits), the read pointer 213 points to the mask entry 210(0) as the current read mask entry. The first micro-op reads the four consecutive mask entries 210(0)-210(3) starting from the mask entry 210(0). The second micro-op reads the mask entries 210(1)-210(4) starting from the mask entry 210(1), where the mask entry pointed by the read pointer 213 is modified by adding 1 micro-op mask size to the position of the read pointer 213. The third micro-op reads the mask entries 210(2)-210(5) starting from the mask entry 210(2), where the mask entry pointed by the read pointer 213 is modified by adding 2 micro-op mask size to the position of the read pointer 213, and so on. The number of the micro-op mask size to be applied depends from the order of the micro-ops or the vector instruction. In the embodiments, 64-bit mask data may be read from mask queue 21. However, the mask data required by the micro-op of the vector instruction may varies based on the width of data element. In the case of 16-bit wide data element, only 32-bit mask data is needed for a micro-op of 512-bit wide vector data length, and the read operation of mask data would increment by a factor of a micro-op mask size of 2 mask entries. The micro-op would use the first 32 bits of the 64-bit read mask data (i.e., read mask data [31:0]) and ignore the last 32 bits of the 64-bit read mask data (i.e., read mask data [63:32]). It should be noted that the read operation of four consecutive mask entries is not intended to limit the disclosure. The read operation of the mask queue 21 may involve various number of mask entries such as 1, 2, 4, 8, 16, and so on may be implemented without departing from the scope of the disclosure. In an alternative embodiment of 1024-bit wide vector data, the execution queue 19 may be configured to read eight consecutive mark entries (i.e., 128-bit mask data).
As an example, the first vector instruction is issued and allocated to the first queue entry 190(0) of the execution queue 19, while the mask data corresponding to the first vector instruction would be allocated to a first plurality of mask entries 210(0)-210(7) of the mask queue 21 based on the write pointer 211. Instead of allocating 512-bit wide mask data from the mask register v(0) in first queue entry 190(0) as part of the first vector instruction in queue, the mask data is allocated to the mask queue 21 based on the position of write pointer 211. The mask data may be sent from the mask register v(0) to the mask queue 21 directly, through hard-wire bus or through the decode/issue unit 13 as part of issuing of the first vector instruction, the disclosure is not intended to limit the transmission path of the mask data. In the embodiments, the first vector instruction would have 128-bit wide mask data (4×32) due to the 16-bit wide data element and 4 micro-ops, which requires 8 mask entries. After the allocation of the first vector instruction, the write pointer 211 is incremented by 8 to indicate the next mask entry for the next vector instruction. In the embodiments, the write enable bits for the first 8 mask entries are set starting from the write pointer 211 to allow 128-bit mask data to be written to the mask queue 21.
With reference to
With reference to
In the case of dispatching the second vector instruction to the functional unit 20, the micro-op mask size is 1 due to the 32-bit wide data element. The first micro-op of the second vector instruction would be dispatched to the functional unit 20 with mask data stored in the mask entry 210(8)-210(11). Since the vector data is 32-bit wide data element, the functional unit 20 can only use the first 16-bit mask data and ignore the other 48-bit mask data. The second micro-op of the second vector instruction would be dispatched to the functional unit 20 with mask data stored in the mask entry 210(9)-210(12).
In some embodiments, double-width vector instruction may be issued to the execution queue 19. In the operation of the double-width vector instruction, a result data of the vector operation would be two times of the width of the source data. In detail, the first half of the source data (i.e., half register width) is used to produce a first result data having full register width, and the second half of the source data are used to produce a second result data having full register width. The source registers are read twice when each micro-op of the double-width vector instruction is executed. In the embodiments, the mask data is for the result data width and not the source data width. As an example, the element data width is 16-bit and the result data width is 32-bit for the double-width instruction. For example, with LMUL=4, the “single-width” vector instruction of 16-bit elements would have 4 micro-op instructions and write back to 4 vector registers of the register file 14 and each micro-op instruction has 32-bit mask data. The “double-width” vector instruction of 16-bit elements would have 8 micro-op instructions and write back to 8 vector registers of the register file 14, where each micro-op instruction has 16-bit mask data. Referring back to
In some other embodiments, if a second vector instruction uses the same mask vector register v(0), same LMUL, and same ELEN, then the second vector instruction can use the same set of mask entries in the mask queue 21 as a first vector instruction. The embodiments do not intend to exclude other sizes of LMUL and ELEN, as long as the mask bits can be derived from the same mask entries based on v(0). There is no need to write the same mask data into the mask queue 21. The same mask vector register v(0) means that the vector register v(0) is not written by another instruction in between the first and the second vector instructions. A scoreboard bit can be used to indicate the status of the mask vector register v(0). The LMUL and ELEN values are stored with the read pointer 213 in order to validate the same LMUL and ELEN of the next vector instruction. The read pointer 213 is used as the identifier for the set of mask entries for a vector instruction. The mask queue 21 may include a vector instruction counter (not shown) to keep track of number of vector instructions using the same set of mask entries, so that the read pointer 213 would only be relocated when the vector instruction counter reaches 0. As each vector instruction uses the same set of mask entries, when the vector instruction is dispatched to the execution queue the vector instruction counter is incremented by 1. When all micro-ops of a vector instruction are dispatched from the execution queue 19 to the functional unit 20, then the vector instruction counter in the mask queue entry is decremented. When vector instruction counter is zero, then the read pointer 213 is incremented by the number of micro-ops, each micro-op has the micro-op mask size. The above description of reusing the mask entries in the mask queue 21 is more efficient usage of mask data and power since the same mask data are not written multiple times into the mask queue 21.
In the above, the mask data is shared by multiple vector instructions in the same execution queue. However, the sharing of mask data is not limited to vector instructions in the same queue. In some other embodiments, the mask data of one mask queue may be shared by multiple vector instructions in different execution queues. For example, a first vector instruction may be issued to the execution queue 19B with a first mask data to the mask queue 21B. If a second vector instruction is issued to the execution queue 19C and have uses the same mask data (i.e., the same mask vector register v(0), same LMUL, and same ELEN) as the first vector instruction in the execution queue 19B, the second vector instruction may also share the mask data in the mask queue 21B. The vector instruction counter as described above may also be used to countdown the first and second vector instructions. In yet some other embodiments, one mask queue may be shared between the execution queues even if the first and second vector instruction do not use the same mask data.
In accordance with the above embodiments, the mask data may be handled by the mask queue presented above, instead of reserving 512 bits (or any width of the mask register) in every entry of execution queue. In the disclosure, mask data of multiple vector instructions may be stored in the mask queue. The corresponding mask data may be accessed from the mask queue when the vector instruction(s) is dispatched from the execution queue to the functional unit for execution. The issuing of the vector instruction from the decode/issue unit to the execution queue may be stalled until the mask queue does not have enough entries to write the mask data.
In the embodiments, one mask queue may be dedicated to one execution queue. In some other embodiments, the mask queue 21 may be shared by more than one execution queues. The same vector mask with same LMUL and ELEN can be used for multiple vector instructions in different functional units in which case of sharing of the mask queue between multiple execution queues may save more area. The embodiments do not limit the sharing of mask queue by multiple execution queues because of sharing of the mask queue entries by vector instructions in different execution queues. Rather, the mask queue can be shared even if vector instructions in different execution queues do not share any mask queue entries. The mask queue entries are marked with LMUL and ELEN, if the second vector instruction uses the same mask vector register, LMUL, and ELEN, then the set of mask queue entries (e.g., 210(0)-210(7) of
In accordance with one of the embodiments, a microprocessor includes a decode/issue unit and an execution queue. The execution queue includes a plurality of queue entries and a mask queue. In the embodiments, the execution queue is configured to allocate a first instruction issued from the decode/issue unit and operating on data having a plurality of first data elements to a first queue entry. The mask queue includes a plurality of mask entries, and a first mask data corresponding to the first instruction is written to a first number of mask entries when the first instruction is allocated to a first queue entry in the execution queue, wherein the first number are determined based on a width of the first data element.
In accordance with one of the embodiments, a method of handling mask data of vector instructions includes at least the following steps: a step of issuing a first instruction operating on data having a plurality of first data elements to an execution queue which includes a mask queue, a step of allocating the first instruction to a first queue entry in the execution queue, and a step of writing a first mask data corresponding to the first instruction to a first number of mask entries in the mask queue, wherein the first number is determined based on a width of the first data element.
The foregoing has outlined features of several embodiments so that those skilled in the art may better understand the detailed description that follows. Those skilled in the art should appreciate that they may readily use the present disclosure as a basis for designing or modifying other processes and structures for carrying out the same purposes and/or achieving the same advantages of the embodiments introduced herein. Those skilled in the art should also realize that such equivalent constructions do not depart from the spirit and scope of the present disclosure, and that they may make various changes, substitutions and alterations herein without departing from the spirit and scope of the present disclosure.