The present invention relates to floating-point processing, and more particularly to a system and method for processing single-precision floating-point numbers.
Single-instruction multiple-data (SIMD) processors are well known. They are typically used to support both single-precision (SP) and double-precision (DP) floating-point multiplication operations to satisfy the requirements of many graphics applications. SIMD processors enable one instruction to perform the same operation on multiple data items. As such, what would typically require a repeated succession of instructions (i.e. a loop) can be performed in one instruction.
A problem with conventional SIMD processors is that they occupy a significant amount of physical space. Conventional SIMD processors have separate SP and DP data paths for executing SIMD instructions. Also, they consume a tremendous amount of power due to the additional hardware required for the data paths. These problems are worsened when SIMD processors are designed to process a large amount of data.
Accordingly, what is needed is an improved system and method for processing both SP and DP floating-point numbers. The system and method should be simple, cost effective, and capable of being easily adapted to existing technology. The present invention addresses such a need.
A system and method for processing single-precision floating-point numbers is disclosed. The system includes a processor that has a double-precision (DP) register, wherein the DP register receives a plurality of single-precision (SP) operands, and a recoder coupled to the DP register, wherein the recoder recodes a first SP operand of the plurality of SP operands. The processor also includes a plurality of partial product (PP) units coupled to the DP register, wherein each PP unit of the plurality of PP units processes a second SP operand of the plurality of SP operands.
According to the method and system disclosed herein, the present invention provides savings in core area, enhances performance by reducing routing problems of operands to DP and SP pipelines, and provides power savings since only one set of registers is clocked for both DP and SP operations.
The present invention relates to floating-point processing, and more particularly to a system and method for processing single-precision floating-point numbers. The following description is presented to enable one of ordinary skill in the art to make and use the invention, and is provided in the context of a patent application and its requirements. Various modifications to the preferred embodiment and the generic principles and features described herein will be readily apparent to those skilled in the art. Thus, the present invention is not intended to be limited to the embodiment shown, but is to be accorded the widest scope consistent with the principles and features described herein.
A processor for processing SP floating-point numbers is disclosed. The processor performs single-precision (SP) multiply operations using a double-precision (DP) design. The system includes a DP register receives an SP multiplier and an SP multiplicand, a recoder that recodes the SP multiplier, and a plurality of partial product (PP) units that processes the SP multiplicand. The processor also includes muxes corresponding with the PP units that generate PPs based on the recoded SP multiplier and the processed SP multiplicand. The processor also includes a Wallace-tree adder that sums the PPs. To more particularly describe the features of the present invention, refer now to the following description in conjunction with the accompanying figures.
Although the present invention is described in the context of 27 PP units 120 [00-26] and 27 booth muxes 130 [00-26], one of ordinary skill in the art will readily recognize that there could be any number of PP units and booth muxes, and their use would be within the spirit and scope of the present invention.
The DP register 102 is a 64-bit register, which can receive both DP and SP operands. In accordance with the present invention, the DP register 102 receives two SP multiplier-multiplicand operand pairs MRSP0 and MPSP0, and MRSP1 and MPSP1. Since a DP mantissa is typically 53 bits and an SP mantissa is typically 24 bits, two SP mantissa are placed appropriately in a 53-bit DP format for booth recoding.
The booth recoder 110 is a DP booth recoder 110 that can receive both DP and SP operands. In accordance with the present invention, the booth recoder 110 receives both of the SP multipliers MRSP0 and MRSP1.
In accordance with the present invention, the PP units can receive both DP and SP operands. As such, each of the PP units 120 [00-26] receives both of the multiplicands MDSP0 and MDSP1. Each PP unit 120 [00-26] is associated with one booth mux 130 [00-26].
Next, in a step 204, the multipliers are recoded. Specifically, the 53-bit data for the multiplier of an SP operation is formed by concatenating the 24-bit multiplier MRSP0, a 4-bit multiplier shift (4′b0000), the 24-bit multiplier MRSP1, and a 1-bit multiplier shift (1′b0). Radix-4 modified booth-recoding is used to recode the multiplier formed by this concatenation. In SP mode, the booth recoding in
Next, in a step 206, the multiplicands are processed in the PP units 120 [00-26]. Specifically, two 24-bit SP multiplicands MDSP0 and MDSP1 are placed appropriately in the 53-bit DP format. The PP units 120 [00-26] generate PP vectors, each of which can one of +2 MD, −2 MD, +1 MD, −1 MD, or 0 MD. These PP vectors are sent to the respective booth muxes 130 [00-26].
Special adjustment of the second SP multiplicand MDSP1 is done to align binary points of the two SP PPs to the ease the design of leading zero anticipators (LZA) for the results of the SP operations. Also, additional logic is used to handle the sign-extension of the DP/SP partial products and bogus carry elimination from the PP vectors.
Next, in a step 208, PPs based on the multiplier and multiplicand are generated at the booth muxes 130 [00-26]. Specifically, each booth mux 130 [00-26] receives PP vectors from its corresponding PP unit 120 [00-26] and receives selection data/bits generated from recoding the multipliers MRSP0 and MRSP1 from the booth recoder 110. The selection data selects the appropriate PP vector (e.g. +2 MD, −2 MD, +1 MD, −1 MD, or 0 MD). Based on the selection data, each booth mux outputs a PP that is based on the selected PP vector. Accordingly, 27 PPs are outputted since there are 27 booth muxes.
Next, in a step 210, the PPs are summed at the adder 140. As shown, the processor 100 executes two SP mantissa operations by placing the two 24-bit SP multipliers MRSP0 and MRSP1 and two 24-bit multiplicands MDSP0 and MDSP1 in the 53-bit double precision format. Accordingly, two SP multiplication operations are performed simultaneously using a DP design.
A benefit of the present invention is that it accommodates multiple data formats, i.e., both DP and SP operations. Both DP and SP operations can be performed in a single-piece of DP hardware. Furthermore, because only a single-piece of DP hardware is used, only one clock is required to operate the DP and SP operations.
Although the present invention is described in the context of two SP multiplier-multiplicand operand pairs MRSP0 and MPSP0, and MRSP1 and MPSP1, one of ordinary skill in the art will readily recognize that there could be any number of SP multiplier-multiplicand operand pairs (e.g. 1, 3, or more), and their use would be within the spirit and scope of the present invention.
Each group is associated with one booth mux. Accordingly, there are 27 groups 302 [00-26] and 27 corresponding booth muxes 130 [00-26]. The bits of each group are used to as selection data for selecting an appropriate PP vector at the respective booth mux 130 [00-26].
The PP unit 400 also includes registers 422, 424, and 426, AND gates 430 and 432, OR gates 434 and 436, and logic 440. The combination of these elements also function to generate PP vectors (i.e., −1 MD and −2 MD) for the booth muxes 130 [14-25]. Note that elements to generate a PP vector 0 MD are not shown since the value would effectively be “0” if selected. Accordingly, the PP unit 400 generates modified 53-bit PP vectors (i.e. +2 MD, −2 MD, +1 MD, −1 MD, and 0 MD), one of which is selected at the respective booth mux 130 [14-25] for processing/compression in the Wallace tree adder 140.
Referring to the register 402, 53-bit data for the multiplicand of the SP operation is formed by concatenating the 24-bit multiplicand MDSP0, a 2-bit multiplicand shift (2′b00), the 24-bit multiplicand MDSP1, and a 3-bit multiplicand shift (3′b000). Accordingly, there is a total of 53 bits. These 53 bits and a DP status signal are inputted into the AND gate 410. The combination of a 1-bit shift of the multiplier MRSP1 and a 3-bit shift of the multiplicand MDSP1 provides a total 4-bit shift. The primary reason behind the extra 4-bit left shift of the multiplicand MDSP1 is to align the product binary points. This eases the leading zero anticipator (LZA) design for an SP operation in a DP pipeline.
In accordance with the present invention, one of the two multiplicands MDSP0 or MDSP1 are forced to zero and the other of the two multiplicands MDSP0 or MDSP1 is latched as an intermediate value. Accordingly, referring to the register 404, the multiplicand MDSP0 is forced to zero and the other multiplicand MDSP1 is latched in the register 404. The result is 1-bit shifted and latched in the register 406. The resulting +1 MD PP vector 420 and the +2 MD PP vector 422 are shown.
When generating a −1 MD PP vector and a −2 MD PP vector, the PP unit 400 operates similarly as when generating a +1 MD PP vector or a +2 MD PP vector, except that the value of the 53-bit multiplicand MD (combined MDSP0 and MDSP1) in the register 422 is the inverse of the 53-bit multiplicand MD in the register 402. The resulting −1 MD PP vector 440 and the −2 MD PP vector 442 are shown.
Accordingly, the PP vectors are appropriately negated/shifted and can then be fed to the booth muxes for selection. The desired multiplication in an SIMD is MR spo X MDSP0 and MRSP0, X MDSP1. The additional logic 420 and 440 prevents multiplication of the operands MRSP0 and MDSP1 and prevents multiplication of the operands MRSP0 and MDSP1. The formatting for the multiplicands MDSP0 and MDSP1, as well as the formatting for the multipliers MRSP0 and MRSP1 enables a common (i.e. single) custom DP circuit to be used for the dynamic table logic for the two SP operands.
Referring to both
There is additional logic (not shown) to generate the sign extension bits in the new positions for the PPs. Also, the LSB of the SP0 PP vectors feeding into the booth mux 130 [12] needs adjustment for DP/SP. Note that there is not any carryout from the right side to the left side. Otherwise, the SP0 PPs will be corrupted. The filler bit is at bit number 52 for the SP0 PPs and at bit number 106 for the SP1 PPs (numbering from 0-160 including upper addend positions). The PP 13 is an unused position, separating the SP0 and SP1 PPs.
According to the system and method disclosed herein, the present invention provides numerous benefits. For example, it provides huge savings in core area, it enhances performance by reducing routing problems of operands to DP and SP pipelines, and it provides power savings since only one set of registers is clocked for both DP and SP operations.
A processor for processing SP floating-point numbers has been disclosed. The processor performs SP multiply operations using a DP design. The system includes a DP register that receives an SP multiplier and an SP multiplicand, a recoder that recodes the SP multiplier, and a plurality of partial product (PP) units that processes the SP multiplicand. The processor also includes muxes corresponding with the PP units that generate PPs based on the recoded SP multiplier and the processed SP multiplicand. The processor also includes a Wallace-tree adder that sums the PPs.
The present invention has been described in accordance with the embodiments shown. One of ordinary skill in the art will readily recognize that there could be variations to the embodiments, and that any variations would be within the spirit and scope of the present invention. For example, the present invention can be implemented using hardware, software, a computer readable medium containing program instructions, or a combination thereof. Software written according to the present invention is to be either stored in some form of computer-readable medium such as memory or CD-ROM, or is to be transmitted over a network, and is to be executed by a processor. Consequently, a computer-readable medium is intended to include a computer readable signal, which may be, for example, transmitted over a network. Accordingly, many modifications may be made by one of ordinary skill in the art without departing from the spirit and scope of the appended claims.