Dual Mode Floating Point Multiply Accumulate Unit

Abstract
Included are embodiments of a Multiply-Accumulate Unit to process multiple format floating point operands. For short format operands, embodiments of the Multiply Accumulate Unit are configured to process data with twice the throughput as long and mixed format data. At least one embodiment can include a short exponent calculation component configured to receive short format data, a long exponent calculation component configured to receive long format data, and a mixed exponent calculation component configured to receive short exponent data, the mixed exponent calculation component further configured to received long format data. Embodiments also include a mantissa datapath configured for implementation to accommodate processing of long, mixed, and short floating point operands.
Description

BRIEF DESCRIPTION

Many aspects of the disclosure can be better understood with reference to the following drawings. The components in the drawings are not necessarily to scale, emphasis instead being placed upon clearly illustrating the principles of the present disclosure. Moreover, in the drawings, like reference numerals designate corresponding parts throughout the several views. While several embodiments are described in connection with these drawings, there is no intent to limit the disclosure to the embodiment or embodiments disclosed herein. On the contrary, the intent is to cover all alternatives, modifications, and equivalents.



FIG. 1A is a flowchart illustrating stream data processing steps that can be taken in an exemplary vector processing unit.



FIG. 1B is a flowchart illustrating stream data processing steps that can be taken in an exemplary scalar processing unit, similar to the steps illustrated in FIG. 1A.



FIG. 1C is an exemplary stream processing SIMD structure with software implementation of complex mathematical functions.



FIG. 1D is an exemplary stream processing SIMD structure with hardware implementation of complex mathematical functions using private special function unit (SFU) for each ALU.



FIG. 1E is an exemplary stream processing SIMD structure with hardware implementation of complex mathematical functions using a common SFU for all ALUs.



FIG. 1F is an exemplary stream processing SIMD structure with implementation of complex mathematical functions using a common SFU with interleaved access to common SFU.



FIG. 1G is an exemplary illustration of an SIMD factor reduction in the case of a common SIMD structure for both vertex and triangle processing.



FIG. 2A a flowchart illustrating steps that can be taken in an exemplary scalar processing unit, similar to the flowchart from FIG. 1, with an SIMD factor 4.



FIG. 2B is a flowchart illustrating steps that can be taken in an exemplary scalar processing unit, similar to the flowchart from FIG. 1, with an SIMD factor 1.



FIG. 2C is a flowchart illustrating steps that can be taken in an exemplary scalar processing unit, similar to the flowchart from FIG. 1, with an SIMD factor 8 for short data format.



FIG. 2D is a flowchart illustrating steps that can be taken in an exemplary processing unit, similar to the flowchart from FIG. 1, with an SIMD factor 4 for short data format.



FIG. 3 is an exemplary logical structure of paired scalar ALUs with dual format processing capabilities, illustrating processing characteristics from FIGS. 1 and 2A-2G, illustrating stream ALU functionality.



FIG. 4 is an exemplary stream processing unit in long format processing mode with paired scalar ALUs, similar to the structure from FIG. 3, and showing an upper level of control and memory.



FIG. 5A is a table illustrating exemplary arithmetic functionality of paired scalar ALUs, and can be used as a base for numerical processing instruction set development such as the ALUs illustrated in FIGS. 3 and 4.



FIG. 5B is a GPU structure where an exemplary stream processor pool is used as a computational core, where the stream processor has a scalable architecture and may contain from 2 to 16 ALUs combined with a reduced number of special function units.



FIG. 6 an exemplary flow diagram and logical structure of a stream processor with 4 scalar ALUs, and SFU interaction, similar to the ALUs from FIGS. 3 and 4.



FIG. 7A is a flowchart illustrating an exemplary normalized vector difference processing in a vector ALU.



FIG. 7B is a flowchart of an exemplary processing routine in a proposed stream scalar ALU combined with an SFU.



FIG. 7C is a continuation of FIG. 7B.



FIG. 8 is an exemplary ALU module, implementing functionality of the ALUs from FIG. 6.



FIG. 9 is an exemplary modular stream processor with a combination of 4 ALU modules, similar to the ALUs from FIGS. 3 and 4.



FIGS. 10A-10C are diagrams illustrating exemplary logical structure and data formats for Multiply Accumulate units, such as the Multiply Accumulate Unit from FIG. 8.



FIG. 11 is an exemplary structure of a MACC unit, similar to the MACC unit from FIG. 8.



FIG. 12 is an exemplary diagram of a short exponent calculation, similar to the short exponent calculation from FIG. 11.



FIG. 13 is an exemplary diagram of a short exponent calculation combined with a mixed exponent, similar to the short exponent calculation from FIG. 11.



FIG. 14 is an exemplary diagram of a short mantissa path for various channels, describing details of the mantissa path illustrated in FIG. 11.



FIG. 15 is an exemplary diagram of a long exponent calculation, describing details of the exponent calculation block from FIG. 11.



FIG. 16 is an exemplary diagram of a long exponent calculation, for a paired ALU, describing details of the long exponent calculation block from FIG.



FIG. 17 is an exemplary diagram of a long mantissa data path, describing details of a data path illustrated in FIG. 11.



FIG. 18 is an exemplary diagram of a long mantissa data path for a paired ALU, similar to the data path illustrated in FIG. 11.



FIG. 19 is an exemplary diagram of a mixed exponent calculation, describing details of the mixed exponent calculation illustrated in FIG. 11.



FIG. 20 is an exemplary diagram of a mixed exponent calculation for a paired ALU, similar to a mixed exponent calculation illustrated in FIG. 19.



FIG. 21 is an exemplary diagram of a mixed mantissa data path, describing details of the data path illustrated in FIG. 11.



FIG. 22 is an exemplary diagram of a mixed mantissa data path for a paired ALU, similar to a data path illustrated in FIG. 21.



FIG. 23 is an exemplary diagram of a merged mantissa data path, which can process short and long data formats, describing details of a possible implementation of the data path illustrated in FIG. 11.



FIG. 24 is an exemplary diagram illustrating a merged mantissa data path, similar to a data path illustrated in FIG. 11.



FIG. 25A is an exemplary diagram illustrating merged shift and control logic, which can be applied in the MACC from FIGS. 23 and 24.



FIG. 25B is an exemplary diagram illustrating sign control logic, which can be applied in the MACC from FIGS. 23 and 24.



FIG. 26 is an exemplary table of complement shift input and output formats, which may be utilized in the MACC from FIG. 11.



FIG. 27A is an exemplary diagram of a mantissa addition path, which can be utilized in the MACC from FIGS. 23 and 24.



FIG. 27B is an exemplary diagram of processing formats that can be utilized in the MAD carry save adder tree units from FIGS. 23 and 24.



FIG. 27C is a continuation of the processing formats from FIG. 27B.



FIG. 28A is an exemplary diagram of a fence implementation in a CSA adder, which may be utilized in the MACC from FIGS. 23 and 24.



FIG. 28B is an exemplary diagram of a fence implementation in a CPA adder, which may be utilized in the MACC from FIGS. 23 and 24.



FIG. 29 is an exemplary diagram of a fence implementation in a complement shift unit, which may be utilized in the MACC from FIGS. 23 and 24.



FIG. 30A is an exemplary fence in a normalization shifter, which may be utilized in the MACC from FIGS. 23 and 24.



FIG. 30B is a more detailed view of the exemplary fence from FIG. 30A.



FIG. 31 is a flowchart illustrating an exemplary process that may be utilized for sending data to a functionally separated ALU.


Claims
  • 1. A Multiply-Accumulate Unit, configured to process a plurality of different data types, the Multiply-Accumulate Unit comprising: a short format exponent datapath configured to facilitate processing of a first set of short format data;a long format exponent datapath configured to facilitate processing of long format data;a mixed format exponent datapath configured to facilitate processing of a second set of short format data and long format data; anda mantissa datapath situated to facilitate processing of a plurality of different formatted operands,wherein a plurality of sets of short format data and a set of long format data are processed utilizing a common hardware structure.
  • 2. The Multiply-Accumulate Unit of claim 1, wherein the mantissa datapath further comprises a sectional multiplier with a plurality of re-configurable outputs, the outputs being configured to process at least one of the following: a plurality of sets of short mantissa data and a set of long mantissa data.
  • 3. The Multiply-Accumulate Unit of claim 1, wherein the mantissa datapath further comprises sectional complement logic and an alignment shifter unit, the alignment shifter unit configured to receive data from an exponent datapath, the alignment shifter unit further configured to receive data from sectional multipliers and input operands.
  • 4. The Multiply-Accumulate Unit of claim 3, wherein the alignment shifter unit is configured to receive at least one of the following: a plurality of sets of short exponent data, a set of long exponent data, a plurality of sets of mixed exponent data, a plurality of sets of short mantissa data, a set of long mantissa data, and a plurality of mixed mantissa data.
  • 5. The Multiply-Accumulate Unit of claim 1, wherein the mantissa datapath further comprising: a first step Multiply and Add Carry Save Adder unit configured to receive data in at least one of a plurality of different data formats and further configured to process the received data and output the processed data to a second step Multiply and Add unit; anda second step Multiply and Add (MAD) unit configured to receive data from a half MAD CSA tree configured, the half MAD CSA tree configured to add partial results from a plurality of sectional multipliers with configurable outputs.
  • 6. The Multiply-Accumulate Unit of claim 1, further comprising at least one of the following for facilitating processing short format data and long format data: a sectional multiplier with re-configurable outputs, sectional complement logic, an alignment shifter unit, a two-step Carry Save Adder (CSA) with fence implementation, a Carry Propagate Adder (CPA) with fence implementation, and normalizer with fenced exponent adder and fenced mantissa shifter.
  • 7. The Multiply-Accumulate Unit of claim 1, further comprising: a sectional multiplier configured to operate with short and long data formats;a Multiply Accumulate (MAC) adder configured to operate as a Carry Save Adder tree; anda full adder and normalization unit configured to convert data from a Carry Save Adder (CSA) redundant format to a normal format.
  • 8. The Multiply-Accumulate Unit of claim 1, further comprising a merged mantissa channel configured to process short format data and long format data.
  • 9. The Multiply-Accumulate Unit of claim 1, further comprising a Multiply-Accumulate Carry Save Adder tree unit, further configured to receive data in any of a plurality of different data formats, the Multiply-Accumulate Carry Save Adder tree unit process the received data and output the processed data to the Normalization unit.
  • 10. A Multiply-Accumulate Unit configured to process a plurality of different data types, the Multiply-Accumulate Unit comprising: a short format exponent data path, the short format exponent data path including a first channel and a second channel, the short format exponent data path also including logic for processing short format exponent data;a merged mantissa data path, the merged mantissa data path including a first channel and a second channel, the merged mantissa data path also including logic for processing short format mantissa data with long format mantissa data; anda sectional multiplier with re-configurable outputs capable of processing at least one of the following: a plurality of sets of short format data and a set of long format data, utilizing a common hardware structure.
  • 11. The Multiply-Accumulate Unit of claim 10, further comprising a long exponent data path, the long exponent data path including a first channel and a second channel, the long exponent data path also including logic for processing long format exponent data.
  • 12. The Multiply-Accumulate Unit of claim 10, further comprising, a mixed mantissa data path, the mixed mantissa data path including a first channel and a second channel, the mixed mantissa data path including logic for processing long format mantissa data with short format mantissa data.
  • 13. The Multiply-Accumulate Unit of claim 10, further comprising a fence for facilitating processing of short format data.
  • 14. The Multiply-Accumulate Unit of claim 10, wherein the short format exponent data includes one-half the number of bits as the long format exponent data.
  • 15. The Multiply-Accumulate Unit of claim 10, wherein the short format mantissa data includes at least one-half the number of bits as the long format mantissa data.
  • 16. A method of processing a plurality of different data types, the method comprising: receiving data at a merged mantissa datapath;determining whether the received data includes short format data;determining whether the received data includes long format data;in response to determining that the received data includes short format data, processing the short format data according to a control signal;in response to determining that the received data includes long format data, processing the long format data according to a control signal; andsending the processed data to output.
  • 17. The method of claim 16, wherein processing includes sending data to a sectional multiplier.
  • 18. The method of claim 16, wherein processing includes: sending data to a Multiply and Add Carry Save Adder unit; andsending data to a complement and alignment shifter unit.
  • 19. The method of claim 16, wherein processing includes sending data to a Multiply-Accumulate Carry Save Adder unit.
  • 20. The method of claim 16, wherein processing includes sending data to a full adder and normalization unit.
Provisional Applications (1)
Number Date Country
60765571 Feb 2006 US