Stream Processor with Variable Single Instruction Multiple Data (SIMD) Factor and Common Special Function

Abstract
Included are embodiments of a stream processor configured to process data in any of a plurality of different formats. At least one embodiment of the stream processor includes a first scalar arithmetic logic unit (ALU), configured to process a plurality of sets of short data in response to a received short format control signal from an instruction set and process a set of long data in response to a received long format control signal from the instruction set. Embodiments of the processor also include a second arithmetic logic unit (ALU), configured to receive the processed data from the first arithmetic logic unit (ALU) and process the input data and the processed data according to a control signal from the instruction set. Still other embodiments include a special function unit (SFU) configured to provide additional computational functionality to the first ALU and the second ALU.
Description

BRIEF DESCRIPTION

Many aspects of the disclosure can be better understood with reference to the following drawings. The components in the drawings are not necessarily to scale, emphasis instead being placed upon clearly illustrating the principles of the present disclosure. Moreover, in the drawings, like reference numerals designate corresponding parts throughout the several views. While several embodiments are described in connection with these drawings, there is no intent to limit the disclosure to the embodiment or embodiments disclosed herein. On the contrary, the intent is to cover all alternatives, modifications, and equivalents.



FIG. 1A is a flowchart illustrating stream data processing steps that can be taken in an exemplary vector processing unit.



FIG. 1B is a flowchart illustrating stream data processing steps that can be taken in an exemplary scalar processing unit, similar to the steps illustrated in FIG. 1A.



FIG. 1C is an exemplary stream processing SIMD structure with software implementation of complex mathematical functions.



FIG. 1D is an exemplary stream processing SIMD structure with hardware implementation of complex mathematical functions using private special function unit (SFU) for each ALU.



FIG. 1E is an exemplary stream processing SIMD structure with hardware implementation of complex mathematical functions using a common SFU for all ALUs.



FIG. 1F is an exemplary stream processing SIMD structure with implementation of complex mathematical functions using a common SFU with interleaved access to common SFU.



FIG. 1G is an exemplary illustration of an SIMD factor reduction in the case of a common SIMD structure for both vertex and triangle processing.



FIG. 2A a flowchart illustrating steps that can be taken in an exemplary scalar processing unit, similar to the flowchart from FIG. 1, with an SIMD factor 4.



FIG. 2B is a flowchart illustrating steps that can be taken in an exemplary scalar processing unit, similar to the flowchart from FIG. 1, with an SIMD factor 1.



FIG. 2C is a flowchart illustrating steps that can be taken in an exemplary scalar processing unit, similar to the flowchart from FIG. 1, with an SIMD factor 8 for short data format.



FIG. 2D is a flowchart illustrating steps that can be taken in an exemplary processing unit, similar to the flowchart from FIG. 1, with an SIMD factor 4 for short data format.



FIG. 3 is an exemplary logical structure of paired scalar ALUs with dual format processing capabilities, illustrating processing characteristics from FIGS. 1 and 2A-2G, illustrating stream ALU functionality.



FIG. 4 is an exemplary stream processing unit in long format processing mode with paired scalar ALUs, similar to the structure from FIG. 3, and showing an upper level of control and memory.



FIG. 5A is a table illustrating exemplary arithmetic functionality of paired scalar ALUs, and can be used as a base for numerical processing instruction set development such as the ALUs illustrated in FIGS. 3 and 4.



FIG. 5B is a GPU structure where an exemplary stream processor pool is used as a computational core, where the stream processor has a scalable architecture and may contain from 2 to 16 ALUs combined with a reduced number of special function units.



FIG. 6 an exemplary flow diagram and logical structure of a stream processor with 4 scalar ALUs, and SFU interaction, similar to the ALUs from FIGS. 3 and 4.



FIG. 7A is a flowchart illustrating an exemplary normalized vector difference processing in a vector ALU.



FIG. 7B is a flowchart of an exemplary processing routine in a proposed stream scalar ALU combined with an SFU.



FIG. 7C is a continuation of FIG. 7B.



FIG. 8 is an exemplary ALU module, implementing functionality of the ALUs from FIG. 6.



FIG. 9 is an exemplary modular stream processor with a combination of 4 ALU modules, similar to the ALUs from FIGS. 3 and 4.



FIGS. 10A-10C are diagrams illustrating exemplary logical structure and data formats for Multiply Accumulate units, such as the Multiply Accumulate Unit from FIG. 8.



FIG. 11 is an exemplary structure of a MACC unit, similar to the MACC unit from FIG. 8.



FIG. 12 is an exemplary diagram of a short exponent calculation, similar to the short exponent calculation from FIG. 11.



FIG. 13 is an exemplary diagram of a short exponent calculation combined with a mixed exponent, similar to the short exponent calculation from FIG. 11.



FIG. 14 is an exemplary diagram of a short mantissa path for various channels, describing details of the mantissa path illustrated in FIG. 11.



FIG. 15 is an exemplary diagram of a long exponent calculation, describing details of the exponent calculation block from FIG. 11.



FIG. 16 is an exemplary diagram of a long exponent calculation, for a paired ALU, describing details of the long exponent calculation block from FIG.



FIG. 17 is an exemplary diagram of a long mantissa data path, describing details of a data path illustrated in FIG. 11.



FIG. 18 is an exemplary diagram of a long mantissa data path for a paired ALU, similar to the data path illustrated in FIG. 11.



FIG. 19 is an exemplary diagram of a mixed exponent calculation, describing details of the mixed exponent calculation illustrated in FIG. 11.



FIG. 20 is an exemplary diagram of a mixed exponent calculation for a paired ALU, similar to a mixed exponent calculation illustrated in FIG. 19.



FIG. 21 is an exemplary diagram of a mixed mantissa data path, describing details of the data path illustrated in FIG. 11.



FIG. 22 is an exemplary diagram of a mixed mantissa data path for a paired ALU, similar to a data path illustrated in FIG. 21.



FIG. 23 is an exemplary diagram of a merged mantissa data path, which can process short and long data formats, describing details of a possible implementation of the data path illustrated in FIG. 11.



FIG. 24 is an exemplary diagram illustrating a merged mantissa data path, similar to a data path illustrated in FIG. 11.



FIG. 25A is an exemplary diagram illustrating merged shift and control logic, which can be applied in the MACC from FIGS. 23 and 24.



FIG. 25B is an exemplary diagram illustrating sign control logic, which can be applied in the MACC from FIGS. 23 and 24.



FIG. 26 is an exemplary table of complement shift input and output formats, which may be utilized in the MACC from FIG. 11.



FIG. 27A is an exemplary diagram of a mantissa addition path, which can be utilized in the MACC from FIGS. 23 and 24.



FIG. 27B is an exemplary diagram of processing formats that can be utilized in the MAD carry save adder tree units from FIGS. 23 and 24.



FIG. 27C is a continuation of the processing formats from FIG. 27B:



FIG. 28A is an exemplary diagram of a fence implementation in a CSA adder, which may be utilized in the MACC from FIGS. 23 and 24.



FIG. 28B is an exemplary diagram of a fence implementation in a CPA adder, which may be utilized in the MACC from FIGS. 23 and 24.



FIG. 29 is an exemplary diagram of a fence implementation in a complement shift unit, which may be utilized in the MACC from FIGS. 23 and 24.



FIG. 30A is an exemplary fence in a normalization shifter, which may be utilized in the MACC from FIGS. 23 and 24.



FIG. 30B is a more detailed view of the exemplary fence from FIG. 30A.



FIG. 31 is a flowchart illustrating an exemplary process that may be utilized for sending data to a functionally separated ALU.


Claims
  • 1. A stream processor configured to process data in any of a plurality of different formats, the stream processor comprising: a first arithmetic logic unit (ALU), configured to: process a first plurality of sets of short format data in response to a received short format control signal from an instruction set; andprocess a first set of long format data in response to a received long format control signal from the instruction set; anda second arithmetic logic unit (ALU), configured to: process a second plurality of sets of short format data in response to a received short format control signal from the instruction set;process a second set of long format data in response to a received long format control signal from the instruction set; andreceive the processed data from the first arithmetic logic unit (ALU); andprocess input data and the processed data from the first ALU according to a control signal from the instruction set.
  • 2. The processor of claim 1, further comprising a special function unit (SFU) configured to provide additional computational functionality to the first ALU and the second ALU.
  • 3. The processor of claim 1, wherein the first ALU is a scalar ALU.
  • 4. The processor of claim 1, wherein the second ALU is a scalar ALU.
  • 5. The processor of claim 1, wherein, in response to receiving short format data, the stream processor is configured to functionally divide at least one pair of the ALUs to facilitate dual format processing with a variable Single Instruction Multiple Data (SIMD) factor for short formats and for long formats.
  • 6. The processor of claim 1, wherein the instruction set includes an instruction for processing variable format data in a plurality of different modes.
  • 7. The processor of claim 1, wherein the instruction set includes at least one of the following: a normal type instruction, a blend type instruction, and a cross type instruction applicable for short format data processing and for long format data processing.
  • 8. The processor of claim 1, wherein the instruction set includes at least one instruction to process in at least one of the following modes: a short format operand mode, a long format operand mode, and a mixed format operand mode.
  • 9. The processor of claim 1, wherein the instruction set is configured to control variable SIMD folding mode, when output data of the first ALU is sent as an operand to the second ALU in long format mode; and wherein the output of one channel of the first ALU is sent as an operand to the second channel of the first ALU in a short format mode.
  • 10. The processor of claim 1, wherein the special function unit is coupled to the first ALU and the second ALU.
  • 11. A method for processing data in any of a plurality of different formats, the method comprising: determining that received data is short format data;in response to determining that the received data is short format data, functionally separate a first arithmetic logic unit (ALU) to a plurality of channels for processing, according to an instruction set;functionally separating a second ALU to a plurality of channels for processing, according to the instruction set;processing data in the first ALU; andsending the processed data to the second functionally separated ALU with a plurality of channels for short data.
  • 12. The method of claim 11, wherein the first ALU is configured to process short format data and long format data.
  • 13. The method of claim 11, wherein the second ALU is configured to process short format data and long format data.
  • 14. The method of claim 11, wherein the first ALU is configured to operate as a scalar ALU.
  • 15. The method of claim 11, wherein the second ALU is configured to operate as a scalar ALU with at least one of the following: a plurality of channels for short format data and a channel for long format data.
  • 16. The method of claim 11, further comprising processing data at a special function unit, wherein the special function unit is configured to receive data from the first ALU and the second ALU.
  • 17. The method of claim 11, wherein the instruction set includes an instruction for processing variable format data in a plurality of different modes.
  • 18. The method of claim 11, wherein the instruction set includes at least one of the following: a normal type instruction, a blend type instruction, and a cross type instruction.
  • 19. A modular stream processor configured to process data in a plurality of different formats, the modular stream processor comprising: a first Arithmetic Logic Unit (ALU) configured to receive first input data and control data, the control data being configured to indicate a format associated with the received input data, the first ALU further configured to process short format input data and long format input data, according to the control data;a second ALU configured to receive the control data from the first ALU, the second ALU further configured to process second input data, the second input data being related to the first input data, the second ALU being further configured to process short format input data and long format input data, according to the control data,a third ALU configured to receive the control data from the second ALU, the third ALU further configured to receive third input data, the third input data being related to the first input data and the second input data, the third ALU further configured to process short format input data and long format input data according to the control data; anda fourth ALU configured to receive the control data from the third ALU, the fourth ALU further configured to receive fourth input data, the fourth input data being related to the first input data, the second input data, and the third input data, the fourth ALU further configured to process short format data and long format data, according to the control data.
  • 20. The modular stream processor of claim 19, wherein the first ALU, the second ALU and the third ALU are configured to receive operation data from a Special Function Unit (SFU), the operation data being configured to indicate an operation to perform on the received input data.
  • 21. The modular stream processor of claim 19, wherein the first ALU is further configured to receive common data, the first ALU being further configured to send the common data to the second ALU, the second ALU being further configured to send the received common data to the third ALU, the third ALU being further configured to send the received common data to the fourth ALU.
  • 22. The modular stream processor of claim 19, wherein at least one of the following is configured to process short format data and long format data: the first ALU, the second ALU, the third ALU, and the fourth ALU.
Provisional Applications (1)
Number Date Country
60765571 Feb 2006 US