The present disclosure is generally related to vector arithmetic reduction.
Advances in technology have resulted in smaller and more powerful computing devices. For example, there currently exist a variety of portable personal computing devices, including wireless computing devices, such as portable wireless telephones, personal digital assistants (PDAs), tablet computers, and paging devices that are small, lightweight, and easily carried by users. Many such computing devices include other devices that are incorporated therein. For example, a wireless telephone can also include a digital still camera, a digital video camera, a digital recorder, and an audio file player. Also, such computing devices can process executable instructions, including software applications, such as a web browser application that can be used to access the Internet and multimedia applications that utilize a still or video camera and provide multimedia playback functionality.
Many such computing devices include vector processors for use in processing wireless transmissions and other activities associated with large quantities of repetitive calculations. Vector processors execute instructions that perform operations on multiple inputs that may be arranged as one-dimensional arrays or vectors. Execution of a vector instruction enables performance of a particular operation on the multiple inputs. For example, executing a conventional vector addition reduction instruction calculates a single sum value based on multiple inputs. Other operations, such as integral functions and cumulative density functions, may use the single sum in addition to one or more partial sums (e.g., one or more sums of less than all of the multiple inputs). In order to generate and output the one or more partial sums, multiple vector instructions are executed. Executing the multiple vector instructions conventionally increases memory usage and power consumption as compared to executing a single vector addition reduction instruction to generate and output a single sum.
A method of executing a cumulative vector arithmetic reduction instruction is disclosed. The cumulative vector arithmetic reduction instruction may be executed at a processor to enable multiple progressive arithmetic operations, such as progressive addition operations, to be performed on an input vector. The input vector may include a plurality of input elements stored in a sequential order. Executing the cumulative vector arithmetic reduction instruction may result in an output vector of multiple output elements. Each output element may be based on a result of applying the arithmetic operation to a corresponding input element of the input vector and any sequentially prior input elements of the input vector. Accordingly, the multiple output values may correspond to multiple partial sums of the plurality of input elements, as well as a sum of all of the plurality of input elements. At least one of the input elements or the output elements may be masked to prevent one or more input elements from being included in the cumulative vector arithmetic reduction operation or to prevent one or more output elements from storing a cumulative vector arithmetic reduction result.
A reduction tree may be selectively configured to execute a sectioned vector arithmetic reduction instruction based on a section grouping size of a sectioned vector arithmetic reduction instruction. The reduction tree may include a plurality of adders arranged into multiple rows. One or more adders of multiple rows may be selectively enabled based on the section grouping size, and multiple output values may be generated by the selectively enabled adders. The multiple output values may be concurrently generated by performing arithmetic (e.g., addition) operations on one or more groups of inputs. Each group may have the section grouping size as a result of the selectively enabled adders. Accordingly, a single reduction tree may be configured to execute multiple section vector arithmetic reduction instructions where each instruction has a different section grouping size.
In a particular embodiment, a method includes executing a vector instruction at a processor. The vector instruction includes a vector input that includes a plurality of elements. Executing the vector instruction includes providing a first element of the plurality of elements as a first output. Executing the vector instruction further includes performing a first arithmetic operation on the first element and a second element of the plurality of elements to provide a second output. Executing the vector instruction further includes storing the first output and the second output in an output vector.
In another particular embodiment, an apparatus includes a processor that includes a reduction tree. During execution of a vector instruction that identifies a vector input that includes a plurality of elements, the reduction tree is configured to provide a first element of the plurality of elements as a first output element. The reduction tree is further configured to perform a first arithmetic operation on the first element and a second element of the plurality of elements to provide a second output element. The reduction tree is further configured to store the first output element and the second output element in an output vector.
In another particular embodiment, an apparatus includes means for providing a first element of a plurality of elements as a first output. A vector instruction indicates a vector input that includes the plurality of elements. The apparatus further includes means for generating a second output based on the first element and a second element of the plurality of elements. The apparatus further includes means for storing the first output and the second output in an output vector.
In another particular embodiment, a non-transitory computer readable medium includes instructions that, when executed by a processor, cause the processor to provide a first element of a plurality of elements as a first output element, to perform an arithmetic operation on the first element and a second element of the plurality of elements to provide a second output, and to store the first output and the second output in an output vector. The plurality of elements is included in a vector input indicated by a vector instruction.
In another particular embodiment, an apparatus includes a reduction tree that includes a plurality of inputs, a plurality of adders, and a plurality of outputs. A processor is configured to use the reduction tree during execution of a first instruction that includes a first section grouping size and execution of a second instruction that includes a second section grouping size. The reduction tree is configured to concurrently generate multiple output elements.
In another particular embodiment, a method includes receiving, at a processor, a vector instruction that includes a section grouping size. The processor includes a reduction tree. The reduction tree includes a plurality of inputs, a plurality of arithmetic operation units, and a plurality of outputs. The method further includes determining the section grouping size. The method further includes executing the vector instruction using the reduction tree to concurrently generate the plurality of outputs based on the section grouping size. The reduction tree is selectively configurable for use with multiple different section grouping sizes.
In a further particular embodiment, a method includes executing a vector instruction that includes a plurality of input elements. Executing the vector instruction includes grouping a first subset of the plurality of input elements to form a first set of input elements. Executing the vector instruction further includes grouping a second subset of the plurality of input elements to form a second set of input elements. Executing the vector instruction further includes performing a first arithmetic operation on the first set of input elements and performing a second arithmetic operation on the second set of input elements. Executing the vector instruction further includes rotating contents on an output register and, after rotating the contents of the output register, inserting first results of the first arithmetic operation and second results of the second arithmetic operation into the output register.
One particular advantage provided by at least one of the disclosed embodiments is a reduction tree that is configured to generate multiple partial results during execution of a single cumulative vector arithmetic reduction instruction. Executing the single cumulative vector arithmetic reduction instruction may use less space in memory and may decrease power consumption as compared to executing multiple vector instructions to generate a similar output. Another particular advantage provided by at least one of the disclosed embodiments is a processor that may be configured to use a single reduction tree during execution of a first instruction having a first section grouping size and during execution of a second instruction having a second grouping size. Using the single reduction tree may decrease chip area and power consumption of the processor as compared to using multiple reduction trees during execution of multiple instructions having different section grouping sizes.
Other aspects, advantages, and features of the present disclosure will become apparent after review of the entire application, including the following sections: Brief Description of the Drawings, Detailed Description, and the Claims.
Referring to
The plurality of elements 102 (e.g., the input vector 122) and the output vector 120 may include N elements, where N is an integer greater than one. The plurality of elements 102 may include a first element 104 (s0), a second element 106 (s1), a third element 108 (s2), and an Nth element 110 (s(N−1)). The plurality of elements 102 may be stored in a sequential order, such as “s0, s1, s2, . . . s(N−1)” where s0 is a first sequential element and s(N−1) is a last sequential element in the sequential order. Although four elements are shown, a number of elements in the plurality of elements 102 (e.g., N) may be more or less than four. In a particular embodiment, a vector permutation instruction is executed using the input vector 122 prior to execution of the cumulative vector arithmetic reduction instruction 101 to arrange the plurality of elements 102 in the sequential order.
Executing the cumulative vector arithmetic reduction instruction 101 may generate multiple output elements (e.g., multiple output values) that are stored in the output vector 120. The output vector 120 may have a same number of elements as the input vector 122 (e.g., N). Executing the cumulative vector arithmetic reduction instruction 101 may include providing N output elements. The N output elements may be stored in the output vector 120. For example, a first output element 112, a second output element 114, a third output element 116, and an Nth output element 118 may be stored in the output vector 120. The output elements 112-118 may be concurrently stored in the output vector 120. For example, the first output element 112 and the second output element 114 may be stored in the output vector 120 during a single execution cycle of the processor that executes the cumulative vector arithmetic reduction instruction 101.
Each output element of the multiple output elements 112-118 (e.g., the N output elements) may be based on an arithmetic operation (e.g., an addition operation) performed on one or more elements of the plurality of elements 102. After execution of the cumulative vector arithmetic reduction instruction 101 using the plurality of elements 102 ordered in the particular sequential order “s0, s1, s2, . . . s(N−1)”, the first output element 112 may equal s0, the second output element 114 may equal s0+s1, the third output element 116 may equal s0+s1+s2, and the Nth output element 118 may equal a sum of each element of the plurality of elements 102 (s0+s1+ . . . +s(N−1)). For example, execution of the cumulative vector arithmetic reduction instruction 101 may include providing (e.g., generating) the first element 104 as the first output element 112 and adding the first element 104 to the second element 106 to provide (e.g., generate) the second output element 114. The first output element 112 and the second output element 114 may be stored in different output elements of the output vector 120. Execution of the cumulative vector arithmetic reduction instruction 101 may further include adding the first element 104 and the second element 106 to the third element 108 to provide the third output element 116, and storing the third output element 116 in the output vector 120. Execution of the cumulative vector arithmetic reduction instruction 101 may further include adding each of the elements of the plurality of elements 102 to provide the Nth output element 118, and storing the Nth output element 118 in the output vector 120.
As illustrated in
Although addition operations have been described, the cumulative vector arithmetic reduction instruction 101 is not limited to performing only addition operations. For example, the cumulative vector arithmetic reduction instruction 101 may indicate one or more arithmetic operations to be performed on the plurality of elements 102. The one or more arithmetic operations may include addition operations, subtraction operations, or a combination thereof. For example, arithmetic reduction may be performed using one or more addition operations, using one or more subtraction operations, or using a combination of one or more addition operations and one or more subtraction operations. The one or more arithmetic operations may be indicated by a value in a particular field (e.g., a particular parameter), such as the fourth field 188. For example, the fourth field 188 may include a pointer to a location in memory storing an operation vector (e.g., a vector that indicates the one or more arithmetic operations) or to a register storing the operation vector. Each element of the operation vector may indicate a particular operation (e.g., an addition operation or a subtraction operation) to be performed on a corresponding element of the plurality of elements 102 during execution of the cumulative vector arithmetic reduction instruction 101. When at least one of the one or more arithmetic operations is a subtraction operation, one or more elements of the plurality of elements 102 may be complemented prior to generating the multiple output elements. For example, one or more elements of the plurality of elements 102 may be complemented based on the cumulative vector arithmetic reduction instruction 101 (e.g., based on the fourth value stored in the fourth field 188) prior to providing the first output element 112 and the second output element 114 (e.g., prior to generating the multiple output elements).
During operation, the processor may receive the cumulative vector arithmetic reduction instruction 101. The processor may execute the cumulative vector arithmetic reduction instruction using the plurality of elements 102 to generate and store the multiple output elements in the output vector 120. The multiple output elements may represent multiple partial results of a cumulative vector arithmetic reduction operation.
By generating multiple partial results (e.g., the multiple output elements 112-118) during execution of a single vector instruction, the cumulative vector arithmetic reduction instruction 101 may provide storage and power consumption benefits as compared to generating the multiple partial results during execution of multiple vector instructions. For example, generating the multiple partial results during execution of the single vector instruction may use less storage in a memory or a register set and may decrease power consumption of the processor as compared to generating the multiple partial results during execution of the multiple vector instructions.
The processor 202 may include an arithmetic logic unit (ALU) 204 and control logic 210. The ALU 204 may include a reduction tree 206 and a rotation unit 208. The ALU 204 may be configured to receive the input vector 122 and to perform one or more arithmetic operations on the input vector 122 using the reduction tree 206. The reduction tree 206 may provide the output vector 120. The output vector 120 may be provided to a location identified by the vector instruction 220, such as a register or a location in memory. For example, the output vector 120 may be provided to the location based on a particular field (e.g., the second field 184 of
The ALU 204 and the reduction tree 206 may be part of an execution pipeline. For example, the processor 202 may be a pipelined vector processor including one or more pipelines. The reduction tree 206 may be included in the one or more pipelines. The reduction tree 206 may have a number of stages (e.g., a stage depth) based on a number of input elements (of the input vector 122). The number of stages of the reduction tree 206 may correspond to a base two logarithm of the number of input elements. For example, when the number of input elements is thirty-two, the reduction tree 206 may have five stages. The reduction tree 206 may include a plurality of arithmetic operation units arranged in one or more rows. Each stage of the reduction tree 206 may correspond to a row of arithmetic operation units of the reduction tree 206.
The control logic 210 may be configured to select (e.g., selectively enable) one or more adders of the plurality of adders of the reduction tree 206 based on the vector instruction 220 (e.g., the cumulative vector arithmetic reduction instruction 101 of
The rotation unit 208 may be configured to receive a rotation vector 280 and to selectively rotate the rotation vector 280 based on the vector instruction 220, as further described with reference to
The rotation unit 208 may be a rotator or a barrel vector shifter, as illustrative examples. The rotation vector 280 may include a plurality of prior elements (e.g., multiple elements generated as a result of execution of a prior vector instruction). The rotation vector 280 may be identified by the vector instruction 220. For example, the rotation vector 280 may be stored in a location, such as a register or a location in memory, identified by a field in the vector instruction 220. In a particular embodiment, a first location associated with the rotation vector 280 is the same as a second location associated with the output vector 120. For example, the vector instruction 220 may identify a particular register as the output vector 120, and previously stored elements (e.g., contents) of the particular register may be used as the rotation vector 280. The previously stored values at the particular register may be a result of a previous vector arithmetic reduction instruction. In another embodiment, the first location associated with the rotation vector 280 is the same as a third location associated with the input vector 122. In other embodiments, the rotation vector 280 may be identified by another value stored in another field of the vector instruction 220 (e.g., by a different value stored in a different field from the output vector 120) or may be predetermined based on an instruction name (e.g., an opcode) of the vector instruction 220.
During operation, the processor 202 may be configured to receive and execute the vector instruction 220 to perform vector arithmetic reduction (e.g., cumulative vector arithmetic reduction or sectioned vector arithmetic reduction) on the input vector 122 using the reduction tree 206. The reduction tree 206 may perform the vector arithmetic reduction on the input vector 122 to concurrently generate multiple results (e.g., during a single execution cycle of the processor 202). The multiple results generated by the reduction tree 206 may be stored in the output vector 120 during execution of the vector instruction 220.
By generating multiple partial results (e.g., the multiple results) during execution of a single vector instruction (e.g., the vector instruction 220), the system 200 may provide storage and power consumption improvements compared to other systems that generate the multiple partial results during execution of multiple vector instructions.
Referring to
Each input element of the plurality of input elements and each output element of the plurality of output elements may include one or more sub-elements. For example, the first input element 302 may include a first plurality of input sub-elements 330-336 (s0-s3), such as a first input sub-element 330 (s0), a second input sub-element 332 (s1), a third input sub-element 334 (s2), and a fourth sub-element 336 (s3). The second input element 304 may include a second plurality of input sub-elements 338-344 (s4-s7), such as a fifth input sub-element 338 (s4), a sixth input sub-element 340 (s5), a seventh input sub-element 342 (s6), and an eighth input sub-element 344 (s7). Further, the first output element 306 may include a first plurality of output sub-elements 366-372 (d0-d3), such as a first output sub-element 366 (d0), a second output sub-element 368 (d1), a third output sub-element 370 (d2), and a fourth output sub-element 372 (d3). The second output element 308 may include a second plurality of output sub-elements 374-380 (d4-d7), such as a fifth output sub-element 374 (d4), a sixth output sub-element 376 (d5), a seventh output sub-element 378 (d6), and an eighth output sub-element 380 (d7). Each input element and output element may have the same size (e.g., the same number of bits). Additionally, each input sub-element may have the same size as each output sub-element (e.g., the same number of bits). For example, each input element (e.g., the first input element 302) and each output element may be sixty-four bits and may include four sixteen-bit sub-elements (e.g., input sub-elements 330-336). In an alternate embodiment, each of the input sub-elements 330-344 is an individual input element and each of the output sub-elements 366-380 is an individual output element, such that the input vector 122 includes a plurality of input elements 330-344 and the output vector 120 includes a plurality of output elements 366-380.
The reduction tree 300 may include a plurality of arithmetic operation units. In a particular embodiment, the plurality of arithmetic operation units may be a plurality of adders, including a first adder 320 and a second adder 321. In other embodiments, the plurality of arithmetic operation units may include subtractors or a combination of adders and subtractors. The plurality of adders may include (e.g., arranged in) one or more rows of adders. For example, the plurality of adders may include (e.g., arranged in) a first row 312. Although depicted as including a single row, the plurality of adders may include more than one row.
One or more adders of the plurality of adders may be selectively enabled, as described with reference to
The plurality of input elements may have an input type indicated by the cumulative vector arithmetic reduction instruction (e.g., by a value stored in the fifth field 190 of the cumulative vector arithmetic reduction instruction 101 of
For example, when the input type is sixteen-bit complex numbers, each input element 302 and 304 may be sixty-four bits, each input sub-element s0, s2, s4, and s6 may represent a sixteen-bit real number value, and each input sub-element s1, s3, s5, and s7 may represent a sixteen-bit imaginary number value. Each sixty-four bit input element may therefore be associated with two sixteen-bit complex input sub-elements (e.g., a first pair of s0 and s1, and a second pair of s2 and s3). As another example, when the input type identifies thirty-two bit complex numbers, each input element 302 and 304 may be sixty-four bits, a first pair of input sub-elements s0 and s1 and a second pair of input sub-elements s4 and s5 may represent thirty-two bit real number values, and a third pair of input sub-elements s2 and s3 and a fourth pair of input sub-elements s6 and s7 may represent thirty-two bit imaginary number values. Each sixty-four bit input element may therefore be associated with one thirty-two bit complex input sub-element (e.g., the first pair of input sub-elements s0 and s1 and the second pair of input sub-elements s2 and s3, or the third pair of input sub-elements s4 and s5 and the fourth pair of input sub-elements s6 and s7). In each example, the plurality of output elements may include similar types of output elements and output sub-elements as the input elements (e.g., the output elements may have a type identified by the input type).
Each adder of the plurality of adders may include multiple sub-adders. For example, the first adder 320 may include a first sub-adder 322, a second sub-adder 324, a third sub-adder 326, and a fourth sub-adder 328. In a particular embodiment, the first adder 320 is a sixty-four bit adder that is partitioned to perform four sixteen-bit addition operations (e.g., each sub-adder 322-328 represents a partition of the first adder 320). In an alternate embodiment, the each sub-adder 322-328 is a sixteen-bit adder, and the first adder 320 represents a group of four sixteen-bit adders. Each adder of the plurality of adders may have a similar configuration as the first adder 320 (e.g., the second adder 321 may include four sub-adders). Although sixty-four bit adders and sixteen-bit sub-adders are described, other sizes of adders and sub-adders may be used, such as based on sizes of the input elements of the input vector 122.
Each adder may be configured to perform multiple addition operations in an interleaved manner via multiple sub-adders. For example, the first adder 320 may be configured to add the first input sub-element 330 (s0) and the fifth input sub-element 338 (s4) using the first sub-adder 322, to add the second input sub-element 332 (s1) and the sixth input sub-element 340 (s5) using the second sub-adder 324, to add the third input sub-element 334 (s2) and the seventh input sub-element 342 (s6) using the third sub-adder 326, and to add the fourth input sub-element 336 (s3) and the eighth input sub-element 344 (s7) using the fourth sub-adder 328. Thus, the reduction tree 300 may be configured to perform a cumulative vector arithmetic reduction operation using the first input element 302 and the second input element 304 on a sub-element by sub-element basis in an interleaved manner. Performing interleaved addition on a sub-element by sub-element basis may enable the reduction tree to perform addition operations on sub-elements having different data types (e.g., real numbers, imaginary numbers, or complex numbers).
Multiple adder outputs of a bottom row (e.g., the first row 312) of the plurality of adders may be provided as output elements (e.g., the output elements 306 and 308) and stored in the output vector 120. For example, each output of each sub-adder of the second adder 321 may be provided as a corresponding output sub-element of the first output element 306 and each output of each sub-adder 322-328 of the first adder 320 may be provided as a corresponding output sub-element of the second output element 308. The multiple output elements 306 and 308 (e.g., the multiple output sub-elements 366-380) may represent multiple partial results of cumulative vector arithmetic reduction.
Executing a received cumulative vector arithmetic reduction instruction may generate multiple partial results of the cumulative vector arithmetic reduction instruction having the input type identified by the cumulative vector arithmetic reduction instruction. For example, when the cumulative vector arithmetic reduction instruction is associated with (e.g., indicates) a complex number operation and the input type is sixteen-bit complex numbers (e.g., input sub-elements s0, s2, s4, and s6 represent real number values and input sub-elements s1, s3, s5, and s7 represent imaginary number values), executing the cumulative vector arithmetic reduction instruction may include generating a first real number sub-element (e.g., the first output sub-element 366 (d0)) of the first output element 306 and a first imaginary number sub-element (e.g., the second output sub-element 368 (d1)) of the first output element 306. Executing the cumulative vector arithmetic reduction instruction may further include generating a second real number sub-element (e.g., the fifth output sub-element 374 (d4)) of the second output element 308 and a second imaginary number sub-element (e.g., the sixth output sub-element 376 (d5)) of the second output element 308. Thus, when the input type identifies that the input elements 302 and 304 are complex numbers, the output elements 306 and 308 may be complex numbers.
During operation, the reduction tree 300 may be used to execute a received cumulative vector arithmetic reduction instruction. During execution of the cumulative vector arithmetic reduction instruction, one or more adders of the plurality of adders may be selectively enabled based on the cumulative vector arithmetic reduction instruction to generate multiple output elements including the output elements 306 and 308 (e.g., including the multiple output sub-elements 366-380 (d0-d7)). For example, the first adder 320 may be selectively enabled entirely, or at least partially (e.g., one or more of the sub-adders 322-328 may be selectively enabled based on the cumulative vector arithmetic reduction instruction). One or more outputs of the plurality of adders may be provided as the output elements 306 and 308 (e.g., the multiple output sub-elements 366-380 (d0-d7)) for storage in the output vector 120 during execution of the cumulative vector arithmetic reduction instruction.
Referring to
The input vector 122 may include the first input element 302, the second input element 304, a third input element 410, and a fourth input element 412. Each input element may include a plurality of input sub-elements. For example, the first input element 302 may include input sub-elements s0-s3, the second input element 304 may include input sub-elements s4-s7, the third input element 410 may include input sub-elements s8-s11, and the fourth input element 412 may include input sub-elements s12-s15. The output vector 120 may include four output elements. For example, the output vector 120 may include the first output element 306, the second output element 308, a third output element 422, and a fourth output element 424. Each output element may include a plurality of output sub-elements. For example, the first output element 306 may include output sub-elements d0-d3, the second output element 308 may include output sub-elements d4-d7, the third output element 422 may include output sub-elements d8-d11, and the fourth output element 424 may include output sub-elements d12-d15.
The plurality of adders may include (e.g., be arranged in) a plurality of rows, such as the first row 312 and second row 414. Although two rows are shown, in other embodiments the plurality of adders may include more rows or fewer rows, such as based on a number of input elements in the input vector 122. Although each row 312, 414 is illustrated as having four adders, in other embodiments each row may have more than or fewer than four adders, such as based on a number of input elements in the input vector 122. Each of the adders 402-408 may include four sub-adders, as described with reference to the adders 320 and 321 of
One or more adders of the plurality of adders may be selectively enabled, as described with reference to
Adder outputs for the second row 414 may be provided as multiple output elements (e.g., the output elements 306, 308, 422, and 424) to be stored in the output vector 120. Through selective enablement, the plurality of adders may generate (e.g., provide) the plurality of output elements stored in the output vector 120. The output elements 306, 308, 422, and 424 (e.g., the output sub-elements d0-d15) may represent one or more partial products of cumulative vector arithmetic reduction. For example, the first output element 306 may be the first input element 302, the second output element 308 may be a sum of the first input element 302 and the second input element 304, the third output element 422 may be a sum of the first input element 302, the second input element 304, and the third input element 410, and the fourth output element 424 may be a sum of the first input element 302, the second input element 304, the third input element 410, and the fourth input element 412. The output elements 306, 308, 422, and 424 may be generated by a sub-element by sub-element basis, where the addition operations are performed in an interleaved manner to generate the output sub-elements d0-d15, as explained with reference to
Although
During operation, the reduction tree 400 may be used to execute a received cumulative vector arithmetic reduction instruction. During execution of the cumulative vector arithmetic reduction instruction, one or more adders of the plurality of adders may be selectively enabled based on the cumulative vector arithmetic reduction instruction to generate the multiple output elements 306, 308, 422, and 424. The multiple output elements 306, 308, 422, and 424 may be stored in the output vector 120 during execution of the cumulative vector arithmetic reduction instruction.
Referring to
The reduction tree 500 may include the plurality of input elements 502, a plurality of adders 504, and a plurality of output elements 506. Although
Each input element of the plurality of input elements 502 may have the same size. For example, each input element of the plurality of input elements 502 may be sixty-four bits. Each output element of the plurality of output elements 506 may also have the same size. For example, each output element of the plurality of output elements 506 may be sixty-four bits. In a particular embodiment, each input element may have the same size as each output element (e.g., sixty-four bits). A number of input elements may be equal to a number of output elements. For example, input vector 122 may have sixteen input elements, and the output vector 120 may have sixteen output elements. The number and size of the elements are illustrative; the input elements and output elements may have other sizes and the vectors (e.g., the input vector 122 and the output vector 120) may have other sizes (e.g., other numbers of elements) than illustrated. Although not illustrated, each input element may include multiple input sub-elements (e.g., four input sub-elements), and each output element may include four output sub-elements, as described with reference to
The plurality of adders 504 may be arranged in multiple rows of adders including a first row 512, a second row 514, a third row 516, and a fourth row 518. Although four rows of adders are illustrated, in other embodiments the reduction tree 500 may include (e.g., be arranged in) fewer than four rows or more than four rows, such as based on the number of input elements and output elements. Each adder of the plurality of adders 504 may have a same size. For example, each adder of the plurality of adders 504 may be a sixty-four bit adder. Although not shown, each adder of the plurality of adders 504 may include a plurality of sub-adders and may be configured to perform addition operations on a sub-element by sub-element basis in an interleaved manner, such as described with reference to
Each adder output may be provided to an adder in the same column on the next row and may also be routed to other adders as shown in
One or more adders of the plurality of adders 504 may be selectively enabled based on the cumulative vector arithmetic reduction instruction. For example, the one or more adders may be selectively enabled (as illustrated by the non-hatched adders of
The reduction tree 500 may be configured to concurrently generate the multiple output elements d0-d15 based on the multiple input elements s0-s15 and the cumulative vector arithmetic reduction instruction. For example, the reduction tree 500 may be configured to provide a first input element s0 as a first output element d0, to add the first input element s0 to a second input element s1 to provide a second output element s1, and to store the first output element s0 and the second output element s1 in the output vector 120. The reduction tree 500 may be configured to add the first element s0 and the second element s1 to a third element s2 to provide a third output element d2. Additionally, the reduction tree 500 may be configured to generate an output element d15 by generating a sum of each input element s0-s15. Output elements d3-d14 may be generated as partial cumulative sums in a similar manner.
During operation, the reduction tree 500 may be used to execute a received cumulative vector arithmetic reduction instruction. During execution of the cumulative vector arithmetic reduction instruction, the reduction tree 500 may receive the plurality of input elements 502 from the input vector 122. During execution of the cumulative vector arithmetic reduction instruction, multiple adders of the plurality of adders 504 may be selectively enabled to provide (e.g., generate) the multiple output elements d0-d15, and the multiple output elements d0-d15 may be stored in the output vector 120.
Referring to
The reduction tree 600 may receive the multiple input elements, including the first input element 302 and the second input element 304, from the input vector 122. The first input element 302 may include input sub-elements s0-s3 and the second input element 304 may include input sub-elements s4-s7. The input elements and input sub-elements may have sizes indicated by the cumulative vector arithmetic reduction instruction. For example, the input elements 302 and 304 may be sixty-four bits, and the input sub-elements s0-s7 may be sixteen bits. The output vector 610 may include the first output element 306 and a second output element 608. The first output element 306 may include output elements d0-d3 and the second output element 608 may include output elements d4-d7. The output elements and output sub-elements may have sizes indicated by the cumulative vector arithmetic reduction instruction. For example, the output elements 306 and 608 may be sixty-four bits, and the output sub-elements d0-d7 may be sixteen bits. Although described as including two elements, the input vector 122 and the output vector 610 may include any number of elements (e.g., any number of sub-elements), and may have other sizes than sixty-four bits.
The reduction tree 600 may include a plurality of adders, including the first adder 320, the second adder 321, a third adder 618, and a fourth adder 619, that are configured to be selectively enabled based on the cumulative vector arithmetic reduction instruction to generate an output vector 610. The plurality of adders may include (e.g., be arranged in) a plurality of rows, including the first row 312, a second row 614, and a third row 616. Each adder of the plurality of adders may include a plurality of sub-adders. For example, each adder of the plurality of adders may be a sixty-four bit adder and may include four sixteen-bit sub-adders. One or more adders of the plurality of adders may be selectively enabled based on the cumulative vector arithmetic reduction instruction. For example, the first adder 320 (e.g., sub-adders 322-328) may be selectively enabled as described with reference to
The third adder 618 in the second row 614 may include a fifth sub-adder 625 configured to add an output of the first sub-adder 322 and an output of the third sub-adder 326. The third adder 618 may also include a sixth sub-adder 627 configured to add an output of the second sub-adder 324 and an output the fourth sub-adder 328. By adding sub-adder outputs, the third adder 618 may apply arithmetic reduction to generate two reduced outputs of the sub-adders 625 and 627 based on the outputs of the sub-adders 322, 324, 326, and 328. Similarly, the fourth adder 619 of the third row 616 may apply arithmetic reduction using a seventh sub-adder 629 to generate an additional reduced value based on the outputs of the sub-adders 625 and 627. Thus, the second output element 608 may include a sixteen-bit reduction value based on the plurality of input sub-elements s0-s7, as well as other partial values. For example, the output sub-element d4 may be equal to a sum of the input sub-element s0 and the input sub-element s4, the output sub-element d5 may be equal to a sum of the input sub-element s1 and the input sub-element s5, the output sub-element d6 may be equal to a sum of the input sub-elements s0, s2, s4, and s6, and the output sub-element d7 may be equal to a sum of the input sub-elements s0-s7.
During operation, the reduction tree 600 may be used to execute the cumulative vector arithmetic reduction instruction. During execution of the cumulative vector arithmetic reduction instruction, one or more adders of the plurality of adders may be selectively enabled based on the cumulative vector arithmetic reduction instruction to generate the multiple output elements 306 and 608 (e.g., the multiple output sub-elements d0-d7) for storage in the output vector 610.
Referring to
The portion of the reduction tree 700 may include a first multiplexer 720 coupled to a first adder 712 and configured to receive the first input element 702 (s0) as a first mux input and a zero input (e.g., an input having a value equal to a logical zero) as a second mux input. Although the first adder 712 is illustrated, the portion of the reduction tree 700 may include a different arithmetic operation unit (e.g., a subtraction unit) in other embodiments. The first multiplexer 720 may be configured to receive a first control signal 744 from control logic, such as the control logic 210 of
The portion of the reduction tree 700 may include a first saturation logic circuit 730 coupled to the first adder 712 and configured to saturate an output of the first adder 712. Saturating the output of the first adder 712 may prevent the output of the first adder 712 from exceeding a maximum value or falling below a minimum value. The first saturation logic circuit 730 may be configured to output a saturated output (e.g., value) based on the output of the first adder 712. For example, the saturated output may have a value equal to the output of the first adder 712 when the output of the first adder 712 is between the minimum value and the maximum value. The saturated output may have a value of the maximum value when the output of the first adder 712 exceeds the maximum value, and the saturated output may have a value of the minimum value when the value of the output of the first adder 712 is less than the minimum value.
The portion of the reduction tree 700 may include a second multiplexer 724 coupled to the first saturation logic circuit 730. The second multiplexer 724 may be configured to receive the saturated output of the first saturation logic circuit 730 as a third mux input and the output of the first multiplexer 720 as a fourth mux input. The second multiplexer 724 may be configured to select between the third mux input and the fourth mux input based on a second control signal 746 to provide a mux output as the first output element 706 to be stored in the output vector. When the second control signal 746 is a particular value, the second multiplexer 724 may bypass the first adder 712 (e.g., provide the fourth mux input as the mux output). When the first adder 712 is not bypassed, the first adder 712 adds a first adder input 732 and a second adder input 734. The second adder input 734 may be a value received from an output of another adder, a zero value, or some other value. By selecting the fourth mux input, the second multiplexer 724 may bypass performing an addition operation using the first adder input 732 and the second adder input 734 and may provide the output of the first multiplexer 720 as the mux output. Thus, the control logic may be configured to bypass the first adder 712 based on the vector instruction. In an alternate embodiment, the first adder 712 may be bypassed by disabling a clock input (not shown).
Although only one input element is shown, the portion of the reduction tree 700 may operate on any number of input elements. For example, the portion of the reduction tree 700 may include additional circuitry (e.g., multiplexers, adders, saturation logic circuits, and connectors) to operate on input vectors having more than one input element. For example, the portion of the reduction tree 700 may include additional rows of adders, where each additional adder includes a corresponding first multiplexer, saturation logic circuit, and third multiplexer. The additional circuitry and adders may be controlled by additional control signals from the control logic. Thus, the portion of the reduction tree 700 may be included in each of the reduction trees 300-600 of
During execution of the vector instruction, the portion of the reduction tree 700 may be configured to receive the first input element 702 and generate the first output element 706 for storage in the output vector. The first multiplexer 720 may provide the zero input to the first adder 712 based on the first control signal 744. The first saturation logic circuit 730 may saturate the output of the first adder 712. The second multiplexer 724 may bypass the first adder 712 based on the second control signal 746.
Referring to
The reduction tree 800 may include the plurality of input elements 802 (e.g., a plurality of input elements s0-s15), a plurality of adders 804, and a plurality of outputs (e.g., a plurality of adder outputs of a bottom row) configured to output multiple output elements 806 (d0-d15). Although
The reduction tree 800 may be configured to receive the plurality of input elements 802 (s0-s15) from an input vector 822. The reduction tree 800 may be configured to generate the multiple output elements 806 (d0-d15) to be stored in an output vector 820. The plurality of input elements 802 (s0-s15) may be ordered in a sequential order, such as “s0, s1, s2, . . . s15” where s0 is a first sequential element and s15 is a last sequential element in the sequential order. The plurality of output elements 806 (d0-d15) may be ordered in a similar sequential order, such as “d0, d1, d2, . . . d15” where d0 is a first sequential element and d15 is a last sequential element.
The reduction tree 800 may have a same number of input elements as output elements, and each input element may have a same size as each output element. For example, the input vector 822 may include sixteen sixty-four bit input elements, and the output vector 820 may include sixteen sixty-four bit output elements. Although not shown, each input element may include a plurality of sixteen-bit input sub-elements, and each output element may include a plurality of sixteen-bit output sub-elements, such as described with reference to
Although sixty-four bit elements and sixteen-bit sub-elements are described, each input element and each output element may have a size other than sixty-four bits, and each input sub-element and each output sub-element may have a size other than sixteen bits.
The plurality of adders 804 may be arranged in multiple rows of adders, as shown. The plurality of adders 804 may include (e.g., be arranged in) a first row 812, a second row 814, a third row 816, and a fourth row 818. Although four rows of adders are illustrated, the reduction tree 800 may alternately include (e.g., be arranged in) fewer than four rows or more than four rows, such as based on the number of input elements and the number of output elements. Each adder of the plurality of adders 804 may have a same size. For example, each adder of the plurality of adders 804 may be a sixty-four bit adder. Although not shown, each adder of the plurality of adders 804 may include a plurality of sub-adders and may be configured to perform addition operations on a sub-element by sub-element basis in an interleaved manner, such as described with reference to
One or more adder outputs from one or more rows of adders may be selectively routed via a plurality of paths 830-844, as shown by the dashed line paths in
The processor may include control logic, such as the control logic 210 of
By selectively enabling one or more adders of the plurality of adders 804 and selecting one or more corresponding adder inputs, the reduction tree 800 may be configured to concurrently generate the multiple output elements 806 (d0-d15) based on the plurality of input elements 802 (s0-s15) and the section grouping size included in the sectioned vector arithmetic reduction instruction (e.g., the first sectioned vector arithmetic reduction instruction or the second sectioned vector arithmetic reduction instruction). For example, when the section grouping size is two, the reduction tree 800 may generate (e.g., provide) a first output element d1 equal to s0+s1, a second output element d3 equal to s2+s3, a third output element d5 equal to s4+s5, a fourth output element d7 equal to s6+s7, a fifth output element d9 equal to s8+s9, a sixth output element d11 equal to s10+s11, a seventh output element d13 equal to s12+s13, and an eighth output element d15 equal to s14+s15. When the section grouping size is four, the reduction tree 800 may generate the second output element d3 equal to s0+s1+s2+s3, the fourth output element d7 equal to s4+s5+s6+s7, the sixth output element d11 equal to s8+s9+s10+s11, and the eighth output element d15 equal to s12-s13+s14+s15. When the section grouping size is eight, the reduction tree 800 may generate the fourth output element d7 equal to s0+s1+s2+s3+s4+s5+s6+s7 and the eighth output element d15 equal to s8+s9+s10+s11+s12−s13+s14+s15. When the section grouping size is sixteen, the reduction tree 800 may generate the eighth output element d15 equal to a sum of each input element s0-s15. Thus, the reduction tree 800 may be configured to selectively enable one or more adders of the multiple rows 812-818 and select one or more corresponding adder inputs based on the section grouping size to concurrently generate the multiple output elements 806.
During operation, the reduction tree 800 may be used to execute the sectioned vector arithmetic reduction instruction. During execution of the sectioned vector arithmetic reduction instruction, the reduction tree 800 may receive the plurality of input element 802 (s0-s15) from the input vector 822. For example, the plurality of input elements 802 (s0-s15) may be grouped into one or more first groups having a first section grouping size during execution of a first sectioned vector arithmetic reduction instruction and into one or more second groups having a second grouping size during execution of a second sectioned vector arithmetic reduction instruction. During execution of the sectioned vector arithmetic reduction instruction, one or more adders of the plurality of adders 804 may be selectively enabled to generate the multiple output elements 806 (d0-d15) using the plurality of outputs (e.g., the plurality of adder outputs of the fourth row 818), and the multiple output elements 806 (d0-15) may be stored in the output vector 820.
The reduction tree 800 enables execution of the first sectioned vector arithmetic reduction instruction having the first section grouping size and the second sectioned vector arithmetic reduction instruction having the second section grouping size using a single reduction tree. Using the single reduction tree may enable reduced device size and power consumption as compared to a processor that includes multiple reduction trees for use during execution of multiple instructions having different section grouping sizes.
Referring to
The multiple output elements 924 may be based on the sectioned vector arithmetic reduction instruction 901. For example, executing the sectioned vector arithmetic reduction instruction 901 may generate a particular output element by adding a particular input element of the plurality of input elements 902 to one or more other input elements of the plurality of input elements 902 based on a section grouping size of the sectioned vector arithmetic reduction instruction 901.
The input register 910 may include the plurality of input elements 902. For example, the plurality of input elements 902 (e.g., the input vector) may include N elements, where N is an integer greater than one. The plurality of input elements 902 may include input elements s0-s(N−1). The plurality of input elements 902 may be stored in a sequential order, such as “s0, s1, s2, . . . s(N−1)” where s0 is a first sequential input element and s(N−1) is a last sequential input element. Although five input elements are shown, a number of the plurality of input elements 902 (e.g., N) may include more than five elements or fewer than five elements.
Before execution of the sectioned vector arithmetic reduction instruction 901, the output register 920 may include multiple prior elements 922. The multiple prior elements 922 may include prior elements d0-d(N−1). The multiple prior elements 922 may be included in another vector, such as the rotation vector 280 of
The process 900 illustrates execution of the sectioned vector arithmetic reduction instruction 901 having an illustrative section grouping size of two. Executing the sectioned vector arithmetic reduction instruction may include grouping the plurality of input elements 902 into multiple groups, such as a first set of input elements 904 and a second set of input elements 906. A first arithmetic (e.g., addition) operation may be performed on the first set of input elements 904 to generate a first result equal to s0+s1, and a second arithmetic (e.g., addition) operation may be performed on the second set of input elements 906 to generate a second result equal to s2+s3. The first result (s0+s1) may be inserted into a first output element 916 of the output register 920 and the second result (s2+s3) may be inserted into a second output element 918 of the output register 920. When a number of results generated is less than the number of output elements in the output register 920, one or more prior elements of the plurality of prior elements 922 may remain (e.g., may not be overwritten) in the output register 920. For example, when the first output element 916 and the second output element 918 are inserted into the output register 920, the plurality of output elements may include prior elements d0 and d2 in the plurality of output elements 924. The plurality of input elements 902 may be grouped into different sets of input elements and different results may be generated when the section grouping size of the sectioned vector arithmetic reduction instruction 901 is a different size.
As illustrated in
Although addition operations have been described, the sectioned vector arithmetic reduction instruction 901 is not limited to performing only addition operations. For example, the sectioned vector arithmetic reduction instruction 901 may indicate one or more arithmetic operations to be performed on the plurality of input elements 902. The one or more arithmetic operations may include addition operations and subtraction operations. The one or more arithmetic operations may be indicated by a value in a particular field (e.g., a particular parameter), such as the fourth field 988. For example, the fourth field 988 may include a pointer to a location in memory storing an operation vector (e.g., a vector that indicates the one or more arithmetic operations) or to a register storing the operation vector. Each element of the operation vector may indicate a particular operation (e.g., an addition operation or a subtraction operation) to be performed on a corresponding element of the plurality of input elements 902 during execution of the sectioned vector arithmetic reduction instruction 901. For example, executing the sectioned vector arithmetic reduction instruction may include grouping the plurality of input elements 902 into one or more input groups based on the section grouping size and performing one or more arithmetic operations on the one or more input groups to generate the multiple output elements 924. When at least one of the one or more arithmetic operations is a subtraction operation, one or more elements of the plurality of input elements 902 may be complemented prior to generating the multiple output elements 924.
During operation, the processor may receive the sectioned vector arithmetic reduction instruction 901. The processor may execute the sectioned vector arithmetic reduction instruction 901 using the plurality of input elements 902 to generate and store the multiple output elements 924 in the output register 920. The multiple output elements 924 may represent results based on the plurality of input elements 902 being grouped into one or more groups of input elements based on the section grouping size of the sectioned vector arithmetic reduction instruction 901.
By generating the multiple output elements 924 based on the section grouping size of the sectioned vector arithmetic reduction instruction 901, the sectioned vector arithmetic reduction instruction 901 enables execution of multiple sectioned vector arithmetic reduction instructions having different section grouping sizes using a single reduction tree. Using the single reduction tree may enable reduced device size and power consumption as compared to a processor that includes multiple reduction trees for use during execution of multiple instructions having different section grouping sizes.
Referring to
The rotate sectioned vector arithmetic reduction instruction 1001 may include an instruction name 1080 (e.g., an opcode), depicted as the name vraddw. The rotate sectioned vector arithmetic reduction instruction 1001 may also include a first field 1082 (Vu), a second field 1084 (Vd), a third field 1086 (Q), a fourth field 1088 (Op), a fifth field 1090 (s2), a sixth field 1092 (sc32), a seventh field 1094 (sat), and an eighth field 1096 (rot). Although eight fields are illustrated, the rotate sectioned vector arithmetic reduction instruction 1001 may include more fields or fewer fields. The fields 1082-1094 may correspond to the fields of the sectioned vector arithmetic reduction instruction 901 of
Execution of the rotate sectioned vector arithmetic reduction instruction 1001 may proceed according to the execution of the sectioned vector arithmetic reduction instruction 901 with the addition of a rotation step. For example, execution of the rotate sectioned vector arithmetic reduction instruction 1001 may include determining whether to rotate the plurality of prior elements 922 in the output register 920 prior to generating the results of the arithmetic operations. Responsive to a first determination that the plurality of prior elements 922 is to be rotated (e.g., based on the value stored in the eighth field 1096), the plurality of prior elements 922 (e.g., contents) in the output register 920 may be rotated by a rotation amount indicated by the eighth field 1096. For example, when the rotation amount is sixty-four bits and the direction is to the right, the plurality of prior elements 922 may be rotated by one prior element to the right. Thus, during execution of the rotate sectioned vector arithmetic reduction instruction 1001 (e.g., prior to generating and storing the results in the output register 910), a first sequential element of the output register 910 may store d(N−1), a second sequential element of the output register 910 may store d(0), a third sequential element of the output register 910 may store d(1), and a last sequential element of the output register 920 may store d(N−2). As another example, when the direction is to the left, the plurality of prior elements 922 may be rotated to the left by the rotation amount. Responsive to a second determination that the plurality of prior elements 922 is not to be rotated (e.g., based on the value stored in the eighth field 1096), the plurality of prior elements 922 may be maintained in a prior sequential order (e.g., d(0) . . . d(N−1)). For example, the plurality of prior elements 922 may not be rotated when the value stored in the eighth field 1096 is a zero value or a null value (e.g., when the eighth field 1096 is not included in the rotate sectioned vector arithmetic reduction instruction 1001). Thus, the plurality of prior elements 922 may be selectively (e.g., optionally) rotated based on the rotate sectioned vector arithmetic reduction instruction 1001.
Executing the rotate sectioned vector arithmetic reduction instruction 1001 may also include determining whether to overwrite the plurality of prior elements 922. For example, each element of the plurality of prior elements 922 that is not replaced by the results of the arithmetic operations may be set to a zero value (e.g., overwritten) based on the rotate sectioned vector arithmetic reduction instruction 1001 (e.g., based on the value stored in the ninth field). A particular prior element may be set to the zero value by a corresponding adder in the reduction tree receiving the zero value for both inputs, as illustrated by the adder beneath input element s0 in the first row of adders 812 of
After the plurality of prior elements 922 in the output register 920 have been rotated, the arithmetic operation results may be generated based on the plurality of input elements 902 and inserted into the output register 920. Execution of the rotate sectioned vector arithmetic reduction instruction 1001 may include grouping the plurality of input elements 902 into multiple groups, such as the first set of input elements 904 and the second set of input elements 906. A first arithmetic (e.g., addition) operation may be performed on the first set of input elements 904 to generate a first result s0+s1, and a second arithmetic (e.g., addition) operation may be performed on the second set of input elements 906 to generate a second result s2+s3. The first result (s0+s1) may be inserted into a first output element 1016 of the output register 920 and the second result (s2+s3) may be inserted into a second output element 1018 of the output register 920. The first output element 1016 and the second output element 1018 may be different output elements of the output register 920.
A first number of input elements of the first set of input elements 904 and a second number of input elements of the second set of input elements 906 may be based on a section grouping size identified by the rotate sectioned vector arithmetic reduction instruction 1001. For example, the first number of elements and the second number of elements may be the same. When a number of results generated is less than the number of output elements in the output register 920, one or more rotated prior elements of the plurality of prior elements 922 (or one or more zero values when the plurality of prior elements 922 are overwritten prior to generating the results) may remain (e.g., may not be overwritten) in the output register 920. For example, when the first output element 1016 and the second output element 1018 are inserted into the output register 920, the plurality of output elements may include rotated prior elements d(N−1) and d1 in the plurality of output elements 1024. The plurality of input elements 902 may be grouped into different sets of input elements and different results may be generated when the section grouping size of the sectioned vector arithmetic reduction instruction 1001 is a different size.
During operation, the processor may receive the rotate sectioned vector arithmetic reduction instruction 1001. The processor may execute the rotate sectioned vector arithmetic reduction instruction 1001 using the plurality of input elements 902 to generate and store the multiple output elements 1024 in the output register 920. Contents (e.g., the plurality of prior elements 922) of the output register may be selectively rotated based on the rotate sectioned vector arithmetic reduction instruction 1001, and results may be generated based on the plurality of input elements 902 being grouped into one or more groups of input elements based on the section grouping size and may be inserted into the output register 920.
Referring to
During execution of the cumulative vector arithmetic reduction instruction, the mask 1130 may be applied to the plurality of elements 102 prior to providing the first element 104 as the first output element 112. Applying the mask 1130 may include providing a zero value for a particular element of the plurality of elements 102 conditioned upon a corresponding mask value of the mask 1130. As shown, the input vector 122 includes the elements s0, s1, s2, and s(N−1) prior to application of the mask 1130 to the plurality of elements 102. After applying the mask 1130, the plurality of elements 102 includes s0, zero (provided in place of s1, based on the corresponding element of the mask 1130 being equal to zero), s2, and s(N−1). In another embodiment, applying the mask 1130 to the plurality of elements may include modifying a value of one or more elements of the plurality of elements 102 in the input vector 122. After applying the mask 1130 to the plurality of elements 102, execution of the cumulative vector arithmetic reduction instruction may proceed as explained with reference to
Referring to
During execution of the cumulative vector arithmetic reduction instruction, the mask 1130 may be applied to the output vector 120 to generate a masked output vector 1126. Applying the mask 1130 as shown may result in the masked output vector 1126 having elements s0, zero, s0+s1+s2, and s0+s1+s2+ . . . +s(N−1). Although
Additionally, the masking shown in
Referring to
A vector instruction may be executed at the processor at 1202. The vector instruction may be the cumulative vector arithmetic reduction instruction 101 of
A first input element of the plurality of input elements may be provided as a first output element, at 1204. The first input element may be the first element 104 (s0) of
A first arithmetic operation may be performed on the first input element and a second input element of the plurality of input elements, at 1206, to provide (e.g., generate) a second output element. For example, the first arithmetic operation may be an addition operation. In other embodiments, the first arithmetic operation may be a subtraction operation. The second input element may be the second element 106 (s1) of
The first output element and the second output element may be stored in an output vector, at 1208. The output vector may be the output vector 120 of
Additional output elements may be generated in this manner. For example, a second arithmetic operation may be performed on the first input element, the second input element, and a third input element of the plurality of input elements to generate (e.g., provide) a third output element. Thus, a particular output element may be generated by performing a particular arithmetic operation on a particular element of the plurality of input elements and one or more other input elements of the plurality of elements that are sequentially prior to the particular input element in the sequential order.
In accordance with the method 1200, multiple output elements (e.g., the first output element and the second output element) may be generated and may represent multiple partial results of cumulative vector arithmetic reduction. By generating multiple partial results during execution of a single vector instruction, the method 1200 may provide storage and power consumption improvements as compared to generating the multiple partial results during execution of multiple vector instructions.
Referring to
A vector instruction including a section grouping size may be received at the processor, at 1302. For example, the vector instruction may be the sectioned vector arithmetic reduction instruction 901 of
The section grouping size may be determined, at 1304. For example, the section grouping size may be determined based on a particular field of the vector instruction, such as the fifth field 990 of
The vector instruction may be executed using the reduction tree to concurrently generate the plurality of outputs based on the section grouping size, at 1306. For example, executing the vector instruction may include grouping the plurality of input elements into one or more groups having the section grouping size and performing one or more arithmetic operations on the one or more groups to generate the plurality of outputs. The plurality of outputs may be generated during a single processing cycle of the processor based on the vector reduction instruction.
The reduction tree may be selectively configurable for use with multiple different section grouping sizes. For example, a configuration of the reduction tree may be associated with a particular section grouping size. The configuration of the reduction tree may be associated with a particular subset of arithmetic operation units being enabled and a particular subset of arithmetic operation unit inputs being selected (e.g., a particular subset of paths being enabled), such as subsets of the plurality of adders 804 and the plurality of paths 830-844 of
In accordance with the method 1300, the reduction tree may be selectively configurable for use with multiple instructions having different section grouping sizes. Using the single reduction tree may enable reduced device size and power consumption as compared to a processor that includes multiple reduction trees for use during execution of multiple instructions having different section grouping sizes.
Referring to
A vector instruction that includes a plurality of input elements may be executed, at 1402. For example, the vector instruction may be the rotate sectioned vector arithmetic reduction instruction 1001 and the plurality of input elements may be the plurality of input elements 902 of
A first subset of the plurality of input elements may be grouped to form a first set of input elements, at 1404. For example, the first set of input elements may be the first set of input elements 1004 of
A second subset of the plurality of input elements may be grouped to form a second set of input elements, at 1406. For example, the second set of input elements may be the second set of input elements 1006 of
A first arithmetic operation may be performed on the first set of input elements, at 1408. For example, a first addition operation may be performed on the first set of input elements. In a particular embodiment, the first arithmetic operation may be indicated by an operation vector. The operation vector may be indicated by a value stored in a particular field (e.g., a parameter) of the rotate sectioned vector arithmetic reduction instruction, such as the fourth field 1088 of the rotate sectioned vector arithmetic reduction instruction 1001 of
A second arithmetic operation may be performed on the second set of input elements, at 1410. For example, a second addition operation may be performed on the second set of input elements. In a particular embodiment, the second arithmetic operation may be indicated by the operation vector.
Contents of an output register may be rotated, at 1412. For example, the output register may be the output register 1020 of
After rotating the contents of the output register, first results of the first arithmetic operation and second results of the second arithmetic operation may be inserted into the output register, at 1414. For example, the first results may be inserted in a first output element of the output register and the second results may be inserted into a second output element of the output register. The first output element may be the first output element 1016 of
According to the method 1400, rotation and sectioned vector arithmetic reduction may be performed for multiple section grouping sizes through execution of a single vector instruction using a single reduction tree. Using the single reduction tree may enable reduced device size and power consumption as compared to a processor that includes multiple reduction trees for use during execution of multiple instructions having different section grouping sizes.
Referring to
The processor 1510 may be configured to execute computer-executable instructions 1560 (e.g., a program of one or more instructions) stored in the memory 1532 (e.g., a computer-readable storage medium). The instructions 1560 may include the cumulative vector arithmetic reduction instruction 1562 and/or the sectioned vector arithmetic reduction instruction 1564. The cumulative vector arithmetic reduction instruction 1562 may be the cumulative vector arithmetic reduction instruction 101 of
A camera interface 1568 is coupled to the processor 1510 and is also coupled to a camera, such as a video camera 1570. A display controller 1526 is coupled to the processor 1510 and to a display 1528. A coder/decoder (CODEC) 1534 may also be coupled to the processor 1510. A speaker 1536 and a microphone 1538 may be coupled to the CODEC 1534. A wireless interface 1540 may be coupled to the processor 1510 and to an antenna 1542 such that wireless data received via the antenna 1542 and the wireless interface 1540 may be provided to the processor 1510.
In a particular embodiment, the processor 1510 may be configured to execute the computer executable instructions 1560 stored at a non-transitory computer-readable medium, such as the memory 1532, that are executable to cause a computer, such as the processor 1510, to provide a first element of a plurality of elements as a first output element. The computer executable instructions 1560 may include the cumulative vector arithmetic reduction instruction 1562. The plurality of elements may be the plurality of elements 102 of
In a particular embodiment, the processor 1510 may be configured to execute the computer executable instructions 1560 stored at a non-transitory computer-readable medium, such as the memory 1532, that are executable to cause a computer, such as the processor 1510, to receive a vector instruction including a section grouping size. The vector instruction may be the sectioned vector arithmetic reduction instruction 1564. The computer executable instructions 1560 may be further executable to determine the section grouping size. The computer executable instructions 1560 may be further executable to execute the vector instruction using a reduction tree to concurrently generate a plurality of outputs based on the section grouping size. The reduction tree may include the reduction tree 206 of
In a particular embodiment, the processor 1510, the display controller 1526, the memory 1532, the CODEC 1534, the wireless interface 1540, and the camera interface 1568 are included in a system-in-package or system-on-chip device 1522. In a particular embodiment, an input device 1530 and a power supply 1544 are coupled to the system-on-chip device 1522. Moreover, in a particular embodiment, as illustrated in
The methods 1200-1400 of
In conjunction with one or more of the described embodiments, an apparatus is disclosed that may include means for providing a first element of a plurality of elements as a first output. The means for providing may include one or more adders of a reduction tree, such as the reduction tree 206 of
The apparatus may also include means for saturating the second output. The means for saturating the second output may include the first saturation logic circuit 730 or the second saturation logic circuit 732 of
In conjunction with one or more of the described embodiments, an apparatus is disclosed that may include means for concurrently generating a plurality of outputs based on a vector instruction. The means for concurrently generating may include the reduction tree 206 of
One or more of the disclosed embodiments may be implemented in a system or an apparatus, such as the device 1500, that may include a set top box, an entertainment unit, a navigation device, a communications device, a personal digital assistant (PDA), a fixed location data unit, a mobile location data unit, a mobile phone, a cellular phone, a computer, a portable computer, a tablet, a desktop computer, a monitor, a computer monitor, a television, a tuner, a radio, a satellite radio, a music player, a digital music player, a portable music player, a video player, a digital video player, a digital video disc (DVD) player, a portable digital video player, or a combination thereof. As another illustrative, non-limiting example, the system or the apparatus may include remote units, such as mobile phones, hand-held personal communication systems (PCS) units, portable data units such as personal data assistants, global positioning system (GPS) enabled devices, navigation devices, fixed location data units such as meter reading equipment, or any other device that stores or retrieves data or computer instructions, or any combination thereof. Although one or more of
One or more of the disclosed embodiments may be implemented in a system or an apparatus, such as the device 1500, that may include a communications device, a fixed location data unit, a mobile location data unit, a mobile phone, a cellular phone, a computer, a tablet, a portable computer, or a desktop computer. Additionally, the device 1500 may include a set top box, an entertainment unit, a navigation device, a personal digital assistant (PDA), a monitor, a computer monitor, a television, a tuner, a radio, a satellite radio, a music player, a digital music player, a portable music player, a video player, a digital video player, a digital video disc (DVD) player, a portable digital video player, any other device that stores or retrieves data or computer instructions, or a combination thereof. As another illustrative, non-limiting example, the system or the apparatus may include remote units, such as mobile phones, hand-held personal communication systems (PCS) units, portable data units such as personal data assistants, global positioning system (GPS) enabled devices, navigation devices, fixed location data units such as meter reading equipment, or any other device that stores or retrieves data or computer instructions, or any combination thereof.
Although one or more of
Those of skill would further appreciate that the various illustrative logical blocks, configurations, modules, circuits, and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, computer software executed by a processor, or combinations of both. Various illustrative components, blocks, configurations, modules, circuits, and steps have been described above generally in terms of their functionality. Whether such functionality is implemented as hardware or as executing software depends upon the particular application and design constraints imposed on the overall system. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present disclosure.
The steps of a method or algorithm described in connection with the embodiments disclosed herein may be embodied directly in hardware, in a software module executed by a processor, or in a combination of the two. A software module may reside in random access memory (RAM), flash memory, read-only memory (ROM), programmable read-only memory (PROM), erasable programmable read-only memory (EPROM), electrically erasable programmable read-only memory (EEPROM), registers, hard disk, a removable disk, a compact disc read-only memory (CD-ROM), or any other form of storage medium known in the art. An exemplary non-transitory (e.g. tangible) storage medium is coupled to the processor such that the processor can read information from, and write information to, the storage medium. In the alternative, the storage medium may be integral to the processor. The processor and the storage medium may reside in an application-specific integrated circuit (ASIC). The ASIC may reside in a computing device or a user terminal. In the alternative, the processor and the storage medium may reside as discrete components in a computing device or user terminal.
The previous description of the disclosed embodiments is provided to enable a person skilled in the art to make or use the disclosed embodiments. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the principles defined herein may be applied to other embodiments without departing from the scope of the disclosure. Thus, the present disclosure is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope possible consistent with the principles and novel features as defined by the following claims.