The field of invention relates generally to computer processor architecture, and, more specifically, to methods and systems for executing vectorized Pythagorean tuple instructions.
A (e.g., hardware) processor, or set of processors, executes instructions from an instruction set, e.g., the instruction set architecture (ISA). The instruction set is the part of the computer architecture related to programming, and generally includes the native data types, instructions, register architecture, addressing modes, memory architecture, and interrupt and exception handling.
One class of mathematical operations relates to computing Pythagorean tuples, such as 2nd-order, 3rd-order, and 4th-order Pythagorean tuples. The latency encountered and the number of individual instructions used in executing Pythagorean tuple instructions can be high, reducing performance, as a minimum of N instructions sometimes need to be executed serially to calculate a Pythagorean tuple of order-N.
The present invention is illustrated by way of example and not limitation in the figures of the accompanying drawings, in which like references indicate similar elements and in which:
In the following description, numerous specific details are set forth. However, it is understood that embodiments of the invention may be practiced without these specific details. In other instances, well-known circuits, structures and techniques have not been shown in detail in order not to obscure the understanding of this description.
References in the specification to “one embodiment,” “an embodiment,” “an example embodiment,” etc., indicate that the embodiment described may include a particular feature, structure, or characteristic, but every embodiment may not necessarily include the particular feature, structure, or characteristic. Moreover, such phrases are not necessarily referring to the same embodiment. Further, when a particular feature, structure, or characteristic is described in connection with an embodiment, it is submitted that it is within the knowledge of one skilled in the art to affect such feature, structure, or characteristic in connection with other embodiments whether or not explicitly described.
Detailed herein are embodiments that execute an order-2, order-3, or order-4 Pythagorean tuple instruction with a single instruction. One class of mathematical operations relates to computing Pythagorean tuples, such as illustrated below in Equations 1, 2, and 3, for 2nd-order, 3rd-order, and 4th-order Pythagorean tuples, respectively.
r=x*x+y*y (Order-2) Equation 1
r=x*x+y*y+z*z (Order-3) Equation 2
r=w*w+x*x+y*y+z*z (Order-4) Equation 3
Disclosed embodiments do not use conventional circuitry, and thus avoid having to issue multiple instructions or incur large latencies. Rather, disclosed embodiments utilize hardware to calculate the tuples in a single instruction.
In some embodiments, register renaming, register allocation, and/or scheduling circuitry 107 provides functionality for one or more of: 1) renaming logical operand values to physical operand values (e.g., a register alias table in some embodiments), 2) allocating status bits and flags to the decoded instruction, and 3) scheduling the decoded instruction for execution on execution circuitry out of an instruction pool (e.g., using a reservation station in some embodiments).
Registers (such as included in register architecture 800, described below) and/or memory 108 store data as operands of the instruction to be operated on by execution circuitry. Exemplary register types include packed data registers, general purpose registers, and floating point registers.
Execution circuitry 109 executes the decoded VPYTH instruction. Exemplary detailed execution circuitry is described further below. In an embodiment of a vectorized operation, the execution of the decoded VPYTH instruction the execution circuitry to execute the decoded instruction on each of a plurality of corresponding pairs of elements of first and second source vectors, the execution to generate a first product by multiplying the first element by itself, generate a second product by multiplying the second element by itself, and accumulate the first and second products with previous contents of the destination.
Write back (retirement) circuitry 111 commits the result of the execution of the decoded VPYTH instruction. Write back (retirement) circuitry 111 is optional, as indicated by its dashed border, at least insofar as it represents functionality that can occur at a different time, at a different stage of the processor's pipeline, or not at all.
Some embodiments of the execution circuitry and processor pipeline in disclosed embodiments are discussed further below with respect to
As shown, multiplier 206A generates a first product by multiplying the first source, SRC1201, by itself (sometimes referred to herein as “squaring”), and multiplier 206B generates a second product by multiplying the second source, SRC2202, by itself (a.k.a. squaring). Adder 208 accumulates the first product and the second product with the previous contents of destination 209. The resulting sum is stored in destination 209 and represents Pythagorean tuple. R+A*A+B*B.
Circuit 200 can be replicated N times to execute a vectorized order-2 Pythagorean tuple instruction in parallel on N elements of a packed data vector. Some embodiments reduce the amount of hardware required by executing the vectorized Pythagorean tuple instruction over multiple cycle. For example, the instruction can be executed over 2 cycles, reducing the required hardware by about a half, or over 4 cycles, reducing the number of instances of circuit 200 by about a quarter. In some embodiments, destination 209 is zeroed after reset so that it has an initial value.
In some embodiments, adder 208 performs saturation, if needed, at the end of the addition. In some embodiments, the data from the first and/or second sources is sign extended prior to multiplication. In some embodiments of integer versions of the instruction, saturation circuitry is used to preserve a sign of an operand when the addition results in a value that is too big. In particular, the saturation evaluation occurs on the infinite precision result in between the multi-way-add and the write to the destination. There are instances where the largest positive or least negative number cannot be trusted since it may reflect that a calculation exceeded the container space. However, this can at least be checked. In some embodiments, the sum of products and the floating point accumulator are turned into infinite precision values (fixed point numbers of hundreds of bits), the addition is performed, and then a single rounding to the actual accumulator type is performed.
In some embodiments, when the input terms are floating point operands, rounding and dealing with special values (infinities and not a numbers (NANs)), the ordering of faults in the calculation needs solving in the definition. In some embodiments, an order of operations is specified that is emulated and ensures that the implementation delivers faults in that order. It may be impossible for such an implementation to avoid multiple roundings in the course of the calculation. A single precision multiply can fill completely into a double precision result regardless of input values. However, the horizontal add of two such operations may not fit into a double without rounding, and the sum may not fit the accumulator without an additional rounding. In some embodiments, rounding is performed during the horizontal summation and once during the accumulation.
As shown, multiplier 216A generates a first product by multiplying the first source, SRC1211, by itself (a.k.a. squaring), multiplier 216B generates a second product by multiplying the second source, SRC2212, by itself (a.k.a. squaring), and multiplier 216C generates a third product by multiplying the third source, SRC3213, by itself (a.k.a. squaring). Adder 218 accumulates the first, second, and third products with the previous contents of destination 219.
Circuit 210 can be replicated N times to execute a vectorized order-3 Pythagorean tuple instruction in parallel on N elements of a packed data vector. Some embodiments reduce the amount of hardware required by executing the vectorized Pythagorean tuple instruction over multiple cycle. For example, the instruction can be executed over 2 cycles, reducing the required hardware by about a half, or over 4 cycles, reducing the number of instances of circuit 210 by about a quarter. In some embodiments, destination 219 is zeroed after reset.
As shown, multiplier 226A generates a first product by multiplying the first source, SRC1221, by itself (a.k.a. squaring), multiplier 226B generates a second product by multiplying the second source, SRC2222, by itself (a.k.a. squaring), multiplier 226C generates a third product by multiplying the third source, SRC3223, by itself (a.k.a. squaring), and multiplier 226D generates a fourth product by multiplying the fourth source, SRC4224, by itself (a.k.a. squaring). Adder 228 accumulates the first, second, third, and fourth products with the previous contents of destination 229.
Circuit 220 can be replicated N times to execute a vectorized order-4 Pythagorean tuple instruction in parallel on N elements of a packed data vector. Some embodiments reduce the amount of hardware required by executing the vectorized Pythagorean tuple instruction over multiple cycle. For example, the instruction can be executed over 2 cycles, reducing the required hardware by about a half, or over 4 cycles, reducing the number of instances of circuit 220 by about a quarter. In some embodiments, destination 229 is zeroed after reset.
Circuit 230 includes round FMA 236A and round FMA 236B, each of which performs a fused multiply and add, with rounding. In some embodiments, round FMA 236A and round FMA 236B comply with one or more standards promulgated by the Institute of Electrical and Electronic Engineers (IEEE), such as IEEE-754-2008. As shown, round FMA 236A generates a first product by multiplying the first source, SRC1231, by itself (sometimes referred to herein as “squaring”), and accumulates the resulting product with the previous contents of DEST 239. In turn, round FMA 236B generates a second product by multiplying the second source, SRC2232, by itself (a.k.a. squaring), and accumulating the product with the output of round FMA 236A. The resulting sum is stored in destination 239 and represents Pythagorean tuple. R+A*A+B*B.
Note that circuit 230 produces the same result as circuit 200 of
Circuit 230 can be replicated N times to execute a vectorized order-2 Pythagorean tuple instruction in parallel on N elements of a packed data vector. Some embodiments reduce the amount of hardware required by executing the vectorized Pythagorean tuple instruction over multiple cycle. For example, the instruction can be executed over 2 cycles, reducing the required hardware by about a half, or over 4 cycles, reducing the number of instances of circuit 230 by about a quarter. In some embodiments, destination 239 is zeroed after reset so that it has an initial value.
Circuit 240 includes round FMA 246A, round FMA 246B, and round FMA 246C, each of which performs a fused multiply and add, with rounding. In some embodiments, round FMA 246A, round FMA 246B, and round FMA 246C comply with one or more IEEE standards, such as IEEE-754-2008. As shown, round FMA 246A generates a first product by multiplying the first source, SRC1241, by itself (sometimes referred to herein as “squaring”), and accumulates the resulting product with the previous contents of DEST 249. In turn, round FMA 246B generates a second product by multiplying the second source, SRC2242, by itself (a.k.a. squaring), and accumulating the product with the output of round FMA 246A. In turn, round FMA 246C generates a third product by multiplying the third source, SRC3243, by itself (a.k.a. squaring), and accumulating the product with the output of round FMA 246B. The resulting sum is stored in destination 249 and represents Pythagorean tuple. R+A*A+B*B+C*C.
Note that circuit 240 produces the same result as circuit 210 of
Circuit 240 can be replicated N times to execute a vectorized order-3 Pythagorean tuple instruction in parallel on N elements of a packed data vector. Some embodiments reduce the amount of hardware required by executing the vectorized Pythagorean tuple instruction over multiple cycle. For example, the instruction can be executed over 2 cycles, reducing the required hardware by about a half, or over 4 cycles, reducing the number of instances of circuit 240 by about a quarter. In some embodiments, destination 249 is zeroed after reset so that it has an initial value.
Circuit 250 includes round FMA 256A, round FMA 256B, round FMA 256C, and round FMA 256D, each of which performs a fused multiply and add, with rounding. In some embodiments, round FMA 256A, round FMA 256B, round FMA 256C, and round FMA 256D comply with one or more IEEE standards, such as IEEE-754-2008. As shown, round FMA 256A generates a first product by multiplying the first source, SRC1251, by itself (sometimes referred to herein as “squaring”), and accumulates the resulting product with the previous contents of DEST 259. In turn, round FMA 256B generates a second product by multiplying the second source, SRC2252, by itself (a.k.a. squaring), and accumulating the product with the output of round FMA 256A. In turn, round FMA 256C generates a third product by multiplying the third source, SRC3253, by itself (a.k.a. squaring), and accumulating the product with the output of round FMA 256B. In turn, round FMA 256D generates a fourth product by multiplying the fourth source, SRC4254, by itself (a.k.a. squaring), and accumulating the product with the output of round FMA 256C. The resulting sum is stored in destination 259 and represents Pythagorean tuple. R+A*A+B*B+C*C+D*D.
Note that circuit 250 produces the same result as circuit 220 of
Circuit 250 can be replicated N times to execute a vectorized order-4 Pythagorean tuple instruction in parallel on N elements of a packed data vector. Some embodiments reduce the amount of hardware required by executing the vectorized Pythagorean tuple instruction over multiple cycle. For example, the instruction can be executed over 2 cycles, reducing the required hardware by about a half, or over 4 cycles, reducing the number of instances of circuit 250 by about a quarter. In some embodiments, destination 259 is zeroed after reset so that it has an initial value.
As shown, each element of the destination register is generated by a circuit including two multipliers and an accumulator. In particular, the least significant element, R0, of DESTINATION 309 is generated by generating a first product by multiplying the least significant element A0 of SOURCE 1301 by itself, generating a second product by multiplying the least significant element B0 of SOURCE 2392 by itself, and accumulating the first product and the second product with the previous value of R0. The second least significant element, R1, of DESTINATION 319 is generated by generating a first product by multiplying the second least significant element A1 of SOURCE 1311 by itself, generating a second product by multiplying the second least significant element B1 of SOURCE 2312 by itself, and accumulating the first product and the second product with the previous value of R1. The most significant element, RN, of DESTINATION 329 is generated by generating a first product by multiplying the most significant element AN of SOURCE 1321 by itself, generating a second product by multiplying the most significant element BN of SOURCE 2322 by itself, and accumulating the first product and the second product with the previous value of R.
Each remaining element of the destination register is generated in a similar fashion. In some embodiments, the execution circuit generates each element of the destination register in parallel. In some embodiments, about half as much hardware is used by taking two cycles to update the destination register. In some embodiments, a quarter as much hardware is used by taking four cycles to update the destination register. In some embodiments, each of the elements of the destination register is updated in serial.
It should be noted that
As shown, each element of the destination register is generated by a circuit including three multipliers and an accumulator. In particular, the least significant element, R0, of DESTINATION 339 is generated by generating a first product by multiplying the least significant element A0 of SOURCE 1331 by itself, generating a second product by multiplying the least significant element B0 of SOURCE 2332 by itself, generating a third product by multiplying the least significant element C0 of SOURCE 3333 by itself, and accumulating the first, second, and third products with the previous value of R0. The second least significant element, R1, of DESTINATION 349 is generated by generating a first product by multiplying the second least significant element A1 of SOURCE 1341 by itself, generating a second product by multiplying the second least significant element B1 of SOURCE 2342 by itself, generating a third product by multiplying the second least significant element C1 of SOURCE 3343 by itself, and accumulating the first, second, and third products with the previous value of R1. The most significant element, RN, of DESTINATION 359 is generated by generating a first product by multiplying the most significant element AN of SOURCE 1351 by itself, generating a second product by multiplying the most significant element BN of SOURCE 2352 by itself, generating a third product by multiplying the most significant element CN of SOURCE 3353 by itself, and accumulating the first, second, and third products with the previous value of RN.
Each remaining element of the destination register is generated in a similar fashion. In some embodiments, the execution circuit generates each element of the destination register in parallel. In some embodiments, about half as much hardware is used by taking two cycles to update the destination register. In some embodiments, a quarter as much hardware is used by taking four cycles to update the destination register. In some embodiments, each of the elements of the destination register is updated in serial.
As shown, each element of the destination register is generated by a circuit including four multipliers and an accumulator. In particular, the least significant element, R0, of DESTINATION 369 is generated by generating a first product by multiplying the least significant element A0 of SOURCE 1361 by itself, generating a second product by multiplying the least significant element B0 of SOURCE 2362 by itself, generating a third product by multiplying the least significant element C0 of SOURCE 3363 by itself, generating a fourth product by multiplying the least significant element D0 of SOURCE 4364 by itself, and accumulating the first, second, third, and fourth products with the previous value of R0. The second least significant element, R1, of DESTINATION 379 is generated by generating a first product by multiplying the second least significant element A1 of SOURCE 1371 by itself, generating a second product by multiplying the second least significant element B1 of SOURCE 2372 by itself, generating a third product by multiplying the second least significant element C1 of SOURCE 3373 by itself, generating a fourth product by multiplying the second least significant element D1 of SOURCE 4374 by itself, and accumulating the first, second, third, and fourth third products with the previous value of R1. The most significant element, RN, of DESTINATION 389 is generated by generating a first product by multiplying the most significant element AN of SOURCE 1381 by itself, generating a second product by multiplying the most significant element BN of SOURCE 2382 by itself, generating a third product by multiplying the most significant element CN of SOURCE 3383 by itself, generating a fourth product by multiplying the most significant element DN of SOURCE 4384 by itself, and accumulating the first, second, third, and fourth products with the previous value of RN.
Each remaining element of the destination register is generated in a similar fashion. In some embodiments, the execution circuit generates each element of the destination register in parallel. In some embodiments, about half as much hardware is used by taking two cycles to update the destination register. In some embodiments, a quarter as much hardware is used by taking four cycles to update the destination register. In some embodiments, each of the elements of the destination register is updated in serial.
As shown, each element of the destination register is generated by a circuit including two round FMA instances. In particular, the least significant element, R0, of DESTINATION 409 is generated by using round FMA 405 to generate a first product by multiplying the least significant element A0 of SOURCE 1401 by itself and accumulating the result with previous contents of destination 409, using round FMA 406 to generate a second product by multiplying the least significant element B0 of SOURCE 2402 by itself, and accumulating the result with the result of round FMA 405. The result is written to DESTINATION 409 and represents R0+A0*A0+B0*B0.
The next least significant element, R1, of DESTINATION 419 is generated by using round FMA 415 to generate a first product by multiplying the next least significant element A1 of SOURCE 1411 by itself and accumulating the result with previous contents of destination 419, using round FMA 416 to generate a second product by multiplying the next least significant element B1 of SOURCE 2412 by itself, and accumulating the result with the result of round FMA 415. The result is written to DESTINATION 419 and represents R1+A1*A1+B1*B1.
The most significant element, RN, of DESTINATION 429 is generated by using round FMA 425 to generate a first product by multiplying the most significant element AN of SOURCE 1421 by itself and accumulating the result with previous contents of destination 429, using round FMA 426 to generate a second product by multiplying the most significant element BN of SOURCE 2422 by itself, and accumulating the result with the result of round FMA 425. The result is written to DESTINATION 429 and represents RN+AN*ANBN*BN.
Each remaining element of the destination register is generated in a similar fashion. In some embodiments, the execution circuit generates each element of the destination register in parallel. In some embodiments, about half as much hardware is used by taking two cycles to update the destination register. In some embodiments, a quarter as much hardware is used by taking four cycles to update the destination register. In some embodiments, each of the elements of the destination register is updated in serial.
As shown, each element of the destination register is generated by a circuit including three round FMA instances. In particular, the least significant element, R0, of DESTINATION 439 is generated by using round FMA 435 to generate a first product by multiplying the least significant element A0 of SOURCE 1431 by itself and accumulating the result with previous contests of destination 439, using round FMA 436 to generate a second product by multiplying the least significant element B0 of SOURCE 2432 by itself, and accumulating the result with the result of round FMA 435. using round FMA 437 to generate a third product by multiplying the least significant element C0 of SOURCE 343 by itself, and accumulating the result with the result of round FMA 436. The result is written to DESTINATION 439 and represents R0+A0*A0+B0*B0+C0*C0.
The next least significant element, R1, of DESTINATION 449 is generated by using round FMA 445 to generate a first product by multiplying the next least significant element A1 of SOURCE 1441 by itself and accumulating the result with previous contents of destination 449, using round FMA 446 to generate a second product by multiplying the next least significant element B1 of SOURCE 2442 by itself, and accumulating the result with the result of round FMA 445, using round FMA 447 to generate a third product by multiplying the next least significant element C1 of SOURCE 3445 by itself, and accumulating the result with the result of round FMA 446. The result is written to DESTINATION 449 and represents R1+A1*A1+B1*B1+C1*C1
The most significant element, RN, of DESTINATION 459 is generated by using round FMA 455 to generate a first product by multiplying the most significant element AN of SOURCE 1451 by itself and accumulating the result with previous contents of destination 459, using round FMA 456 to generate a second product by multiplying the most significant element BN of SOURCE 2452 by itself, and accumulating the result with the result of round FMA 455, and using round FMA 457 to generate a third product by multiplying the most significant element CN of SOURCE 3453 by itself, and accumulating the result with the result of round FMA 456. The result is written to DESTINATION 459 and represents RN+AN*AN+BN*BN+CN*CN.
Each remaining element of the destination register is generated in a similar fashion. In some embodiments, the execution circuit generates each element of the destination register in parallel. In some embodiments, about half as much hardware is used by taking two cycles to update the destination register. In some embodiments, a quarter as much hardware is used by taking four cycles to update the destination register. In some embodiments, each of the elements of the destination register is updated in serial.
As shown, each element of the destination register is generated by a circuit including four round FMA instances. In particular, the least significant element, R0, of DESTINATION 469 is generated by using round FMA 465 to generate a first product by multiplying the least significant element A0 of SOURCE 1461 by itself and accumulating the result with previous contents of destination 469, using round FMA 466 to generate a second product by multiplying the least significant element B0 of SOURCE 2462 by itself, and accumulating the result with the result of round FMA 465. using round FMA 467 to generate a third product by multiplying the least significant element C0 of SOURCE 3463 by itself, using round FMA 468 to generate a fourth product by multiplying the least significant element D0 of SOURCE 4464 by itself and accumulating the result with the result of round FMA 467. The result is written to DESTINATION 469 and represents R0+A0*A0+B0*B0+C0*C0+D0*D0.
The next least significant element, R1, of DESTINATION 479 is generated by using round FMA 475 to generate a first product by multiplying the next least significant element Al of SOURCE 1471 by itself and accumulating the result with previous contents of destination 479, using round FMA 476 to generate a second product by multiplying the next least significant element B1 of SOURCE 2472 by itself, and accumulating the result with the result of round FMA 475, using round FMA 477 to generate a third product by multiplying the next least significant element C1 of SOURCE 3473 by itself, using round FMA 478 to generate a fourth product by multiplying the next least significant element D1 of SOURCE 4474 by itself, and accumulating the result with the result of round FMA 477. The result is written to DESTINATION 479 and represents R1+A1*A1+B1* B1+C1*C1+D1*D1.
The most significant element, RN, of DESTINATION 489 is generated by using round FMA 485 to generate a first product by multiplying the most significant element AN of SOURCE 1481 by itself and accumulating the result with previous contents of destination 489, using round FMA 486 to generate a second product by multiplying the most significant element BN of SOURCE 2482 by itself, and accumulating the result with the result of round FMA 485, using round FMA 487 to generate a third product by multiplying the most significant element CN of SOURCE 3483 by itself, and accumulating the result with the result of round FMA 486, and using round FMA 488 to generate a fourth product by multiplying the most significant element DN of SOURC43484 by itself, and accumulating the result with the result of round FMA 487. The result is written to DESTINATION 489 and represents RN+AN*AN+BN*BN+CN*CN+DN*DN.
Each remaining element of the destination register is generated in a similar fashion. In some embodiments, the execution circuit generates each element of the destination register in parallel. In some embodiments, about half as much hardware is used by taking two cycles to update the destination register. In some embodiments, a quarter as much hardware is used by taking four cycles to update the destination register. In some embodiments, each of the elements of the destination register is updated in serial.
Opcode 701 is illustrated with an exemplary opcode, VPYTH*, which includes an asterisk. The asterisk signifies that the opcode can include various prefixes or suffixes to specify the instruction behavior. For example, a “2,” “3,” or “4” can be included as a prefix to the opcode, the prefix taking the place of order operand 703.
Destination identifier DEST 702, and source identifiers SRC 1704, SRC 2705, SRC 3706, and SRC 4707 can specify a packed data vector register, or a memory location containing a packed data vector. The third and fourth sources, SRC 3706 and SRC 4707 are optional insofar as they are only included when the order 703 is three or four. Precision 708 specifies the size of each of the vector elements to be processed, and is one of SINGLE (32-bit single-precision floating point) and DOUBLE (64-bit double-precision floating point). Precision 708 can also specify 8-bit sized and 16-bit sized vector elements. Precision 708 is optional, as indicated by its dashed border, insofar as a default precision is used if the instruction lacks a precision operand. Writemask 709 is a multi-bit value, with each bit controlling whether execution results will be written to a corresponding element of the destination. Register size 710 specifies the width of the source and destination registers, and specifies one of 128 bits, 256 bits, and 512 bits. The register size, divided by the size of the elements, indicates how many elements will be processed.
The format of the VPYTH* instruction according to disclosed embodiments is further described below, and with reference to
Detailed below are further embodiments of an instruction format for the above described instructions and architectures (e.g., pipelines, cores, etc.) and systems that support these instructions and the embodiments detailed above.
An instruction set includes one or more instruction formats. A given instruction format defines various fields (number of bits, location of bits) to specify, among other things, the operation to be performed (opcode) and the operand(s) on which that operation is to be performed. Some instruction formats are further broken down though the definition of instruction templates (or subformats). For example, the instruction templates of a given instruction format may be defined to have different subsets of the instruction format's fields (the included fields are typically in the same order, but at least some have different bit positions because there are less fields included) and/or defined to have a given field interpreted differently. Thus, each instruction of an ISA is expressed using a given instruction format (and, if defined, in a given one of the instruction templates of that instruction format) and includes fields for specifying the operation and the operands. For example, an exemplary ADD instruction has a specific opcode and an instruction format that includes an opcode field to specify that opcode and operand fields to select operands (source1/destination and source2); and an occurrence of this ADD instruction in an instruction stream will have specific contents in the operand fields that select specific operands.
Embodiments of the instruction(s) described herein may be embodied in different formats. Additionally, exemplary systems, architectures, and pipelines are detailed below. Embodiments of the instruction(s) may be executed on such systems, architectures, and pipelines, but are not limited to those detailed.
VEX encoding allows instructions to have more than two operands, and allows SIMD vector registers to be longer than 78 bits. The use of a VEX prefix provides for three-operand (or more) syntax. For example, previous two-operand instructions performed operations such as A=A+B, which overwrites a source operand. The use of a VEX prefix enables operands to perform nondestructive operations such as A=B+C.
VEX Prefix (Bytes 0-2) 712 is encoded in a three-byte form. The first byte is the Format Field 790 (VEX Byte 0, bits [7:0]), which contains an explicit C4 byte value (the unique value used for distinguishing the C4 instruction format). The second-third bytes (VEX Bytes 1-2) include a number of bit fields providing specific capability. Specifically, REX field 711 (VEX Byte 1, bits [7-5]) consists of a VEX.R bit field (VEX Byte 1, bit [7]-R), VEX.X bit field (VEX byte 1, bit [6]-X), and VEX.B bit field (VEX byte 1, bit [5]-B). Other fields of the instructions encode the lower three bits of the register indexes as is known in the art (rrr, xxx, and bbb), so that Rrrr, Xxxx, and Bbbb may be formed by adding VEX.R, VEX.X, and VEX.B. Opcode map field 715 (VEX byte 1, bits [4:0]-mmmmm) includes content to encode an implied leading opcode byte. W Field 764 (VEX byte 2, bit [7]-W)—is represented by the notation VEX.W, and provides different functions depending on the instruction. The role of VEX.vvvv 720 (VEX Byte 2, bits [6:3]-vvvv) may include the following: 1) VEX.vvvv encodes the first source register operand, specified in inverted (1s complement) form and is valid for instructions with 2 or more source operands; 2) VEX.vvvv encodes the destination register operand, specified in 1s complement form for certain vector shifts; or 3) VEX.vvvv does not encode any operand, the field is reserved and should contain 1111b. If VEX.L 768 Size field (VEX byte 2, bit [2]-L) =0, it indicates 78 bit vector; if VEX.L =1, it indicates 256 bit vector. Prefix encoding field 725 (VEX byte 2, bits [1:0]-pp) provides additional bits for the base operation field 741.
Real Opcode Field 730 (Byte 3) is also known as the opcode byte. Part of the opcode is specified in this field.
MOD R/M Field 740 (Byte 4) includes MOD field 742 (bits [7-6]), Reg field 744 (bits [5-3]), and R/M field 746 (bits [2-0]). The role of Reg field 744 may include the following: encoding either the destination register operand or a source register operand (the rrr of Rrrr), or be treated as an opcode extension and not used to encode any instruction operand. The role of R/M field 746 may include the following: encoding the instruction operand that references a memory address, or encoding either the destination register operand or a source register operand.
Scale, Index, Base (SIB)—The content of Scale field 750 (Byte 5) includes SS752 (bits [7-6]), which is used for memory address generation. The contents of SIB.xxx 754 (bits [5-3]) and SIB.bbb 756 (bits [2-0]) have been previously referred to with regard to the register indexes Xxxx and Bbbb.
The Displacement Field 762 and the immediate field (IMM8) 772 contain data.
General-purpose registers 825—in the embodiment illustrated, there are sixteen 64-bit general-purpose registers that are used along with the existing x86 addressing modes to address memory operands. These registers are referenced by the names RAX, RBX, RCX, RDX, RBP, RSI, RDI, RSP, and R8 through R15.
Scalar floating point stack register file (x87 stack) 845, on which is aliased the MMX packed integer flat register file 850—in the embodiment illustrated, the x87 stack is an eight-element stack used to perform scalar floating-point operations on 32/64/80-bit floating point data using the x87 instruction set extension; while the MMX registers are used to perform operations on 64-bit packed integer data, as well as to hold operands for some operations performed between the MMX and XMM registers.
Alternative embodiments of the invention may use wider or narrower registers. Additionally, alternative embodiments of the invention may use more, less, or different register files and registers.
Exemplary Core Architectures, Processors, and Computer Architectures
Processor cores may be implemented in different ways, for different purposes, and in different processors. For instance, implementations of such cores may include: 1) a general purpose in-order core intended for general-purpose computing; 2) a high performance general purpose out-of-order core intended for general-purpose computing; 3) a special purpose core intended primarily for graphics and/or scientific (throughput) computing. Implementations of different processors may include: 1) a CPU including one or more general purpose in-order cores intended for general-purpose computing and/or one or more general purpose out-of-order cores intended for general-purpose computing; and 2) a coprocessor including one or more special purpose cores intended primarily for graphics and/or scientific (throughput). Such different processors lead to different computer system architectures, which may include: 1) the coprocessor on a separate chip from the CPU; 2) the coprocessor on a separate die in the same package as a CPU; 3) the coprocessor on the same die as a CPU (in which case, such a coprocessor is sometimes referred to as special purpose logic, such as integrated graphics and/or scientific (throughput) logic, or as special purpose cores); and 4) a system on a chip that may include on the same die the described CPU (sometimes referred to as the application core(s) or application processor(s)), the above described coprocessor, and additional functionality. Exemplary core architectures are described next, followed by descriptions of exemplary processors and computer architectures. Detailed herein are circuits (units) that comprise exemplary cores, processors, etc.
In-Order and Out-of-Order Core Block Diagram
In
The front end unit 930 includes a branch prediction unit 932 coupled to an instruction cache unit 934, which is coupled to an instruction translation lookaside buffer (TLB) 936, which is coupled to an instruction fetch unit 938, which is coupled to a decode unit 940. The decode unit 940 (or decoder) may decode instructions, and generate as an output one or more micro-operations, micro-code entry points, microinstructions, other instructions, or other control signals, which are decoded from, or which otherwise reflect, or are derived from, the original instructions. The decode unit 940 may be implemented using various different mechanisms. Examples of suitable mechanisms include, but are not limited to, look-up tables, hardware implementations, programmable logic arrays (PLAs), microcode read only memories (ROMs), etc. In one embodiment, the core 990 includes a microcode ROM or other medium that stores microcode for certain macroinstructions (e.g., in decode unit 940 or otherwise within the front end unit 930). The decode unit 940 is coupled to a rename/allocator unit 952 in the execution engine unit 950.
The execution engine unit 950 includes the rename/allocator unit 952 coupled to a retirement unit 954 and a set of one or more scheduler unit(s) 956. The scheduler unit(s) 956 represents any number of different schedulers, including reservations stations, central instruction window, etc. The scheduler unit(s) 956 is coupled to the physical register file(s) unit(s) 958. Each of the physical register file(s) units 958 represents one or more physical register files, different ones of which store one or more different data types, such as scalar integer, scalar floating point, packed integer, packed floating point, vector integer, vector floating point, status (e.g., an instruction pointer that is the address of the next instruction to be executed), etc. In one embodiment, the physical register file(s) unit 958 comprises a vector registers unit and a scalar registers unit. These register units may provide architectural vector registers, vector mask registers, and general purpose registers. The physical register file(s) unit(s) 958 is overlapped by the retirement unit 954 to illustrate various ways in which register renaming and out-of-order execution may be implemented (e.g., using a reorder buffer(s) and a retirement register file(s); using a future file(s), a history buffer(s), and a retirement register file(s); using a register maps and a pool of registers; etc.). The retirement unit 954 and the physical register file(s) unit(s) 958 are coupled to the execution cluster(s) 960. The execution cluster(s) 960 includes a set of one or more execution units 962 and a set of one or more memory access units 964. The execution units 962 may perform various operations (e.g., shifts, addition, subtraction, multiplication) and on various types of data (e.g., scalar floating point, packed integer, packed floating point, vector integer, vector floating point). While some embodiments may include a number of execution units dedicated to specific functions or sets of functions, other embodiments may include only one execution unit or multiple execution units that all perform all functions. The scheduler unit(s) 956, physical register file(s) unit(s) 958, and execution cluster(s) 960 are shown as being possibly plural because certain embodiments create separate pipelines for certain types of data/operations (e.g., a scalar integer pipeline, a scalar floating point/packed integer/packed floating point/vector integer/vector floating point pipeline, and/or a memory access pipeline that each have their own scheduler unit, physical register file(s) unit, and/or execution cluster—and in the case of a separate memory access pipeline, certain embodiments are implemented in which only the execution cluster of this pipeline has the memory access unit(s) 964). It should also be understood that where separate pipelines are used, one or more of these pipelines may be out-of-order issue/execution and the rest in-order.
The set of memory access units 964 is coupled to the memory unit 970, which includes a data TLB unit 972 coupled to a data cache unit 974 coupled to a level 2 (L2) cache unit 976. In one exemplary embodiment, the memory access units 964 may include a load unit, a store address unit, and a store data unit, each of which is coupled to the data TLB unit 972 in the memory unit 970. The instruction cache unit 934 is further coupled to a level 2 (L2) cache unit 976 in the memory unit 970. The L2 cache unit 976 is coupled to one or more other levels of cache and eventually to a main memory.
By way of example, the exemplary register renaming, out-of-order issue/execution core architecture may implement the pipeline 900 as follows: 1) the instruction fetch 938 performs the fetch and length decoding stages 902 and 904; 2) the decode unit 940 performs the decode stage 906; 3) the rename/allocator unit 952 performs the allocation stage 908 and renaming stage 910; 4) the scheduler unit(s) 956 performs the schedule stage 912; 5) the physical register file(s) unit(s) 958 and the memory unit 970 perform the register read/memory read stage 914; the execution cluster 960 perform the execute stage 916; 6) the memory unit 970 and the physical register file(s) unit(s) 958 perform the write back/memory write stage 918; 7) various units may be involved in the exception handling stage 922; and 8) the retirement unit 954 and the physical register file(s) unit(s) 958 perform the commit stage 924.
The core 990 may support one or more instructions sets (e.g., the x86 instruction set (with some extensions that have been added with newer versions); the MIPS instruction set of MIPS Technologies of Sunnyvale, Calif.; the ARM instruction set (with optional additional extensions such as NEON) of ARM Holdings of Sunnyvale, Calif.), including the instruction(s) described herein. In one embodiment, the core 990 includes logic to support a packed data instruction set extension (e.g., AVX1, AVX2), thereby allowing the operations used by many multimedia applications to be performed using packed data.
It should be understood that the core may support multithreading (executing two or more parallel sets of operations or threads), and may do so in a variety of ways including time sliced multithreading, simultaneous multithreading (where a single physical core provides a logical core for each of the threads that physical core is simultaneously multithreading), or a combination thereof (e.g., time sliced fetching and decoding and simultaneous multithreading thereafter such as in the Intel® Hyperthreading technology).
While register renaming is described in the context of out-of-order execution, it should be understood that register renaming may be used in an in-order architecture. While the illustrated embodiment of the processor also includes separate instruction and data cache units 934/974 and a shared L2 cache unit 976, alternative embodiments may have a single internal cache for both instructions and data, such as, for example, a Level 1 (L1) internal cache, or multiple levels of internal cache. In some embodiments, the system may include a combination of an internal cache and an external cache that is external to the core and/or the processor. Alternatively, all of the cache may be external to the core and/or the processor.
Specific Exemplary In-Order Core Architecture
The local subset of the L2 cache 1004 is part of a global L2 cache that is divided into separate local subsets, one per processor core. Each processor core has a direct access path to its own local subset of the L2 cache 1004. Data read by a processor core is stored in its L2 cache subset 1004 and can be accessed quickly, in parallel with other processor cores accessing their own local L2 cache subsets. Data written by a processor core is stored in its own L2 cache subset 1004 and is flushed from other subsets, if necessary. The ring network ensures coherency for shared data. The ring network is bi-directional to allow agents such as processor cores, L2 caches and other logic blocks to communicate with each other within the chip. Each ring data-path is 1024-bits wide per direction in some embodiments.
Processor with Integrated Memory Controller and Graphics
Thus, different implementations of the processor 1100 may include: 1) a CPU with the special purpose logic 1108 being integrated graphics and/or scientific (throughput) logic (which may include one or more cores), and the cores 1102A-N being one or more general purpose cores (e.g., general purpose in-order cores, general purpose out-of-order cores, a combination of the two); 2) a coprocessor with the cores 1102A-N being a large number of special purpose cores intended primarily for graphics and/or scientific (throughput); and 3) a coprocessor with the cores 1102A-N being a large number of general purpose in-order cores. Thus, the processor 1100 may be a general-purpose processor, coprocessor or special-purpose processor, such as, for example, a network or communication processor, compression engine, graphics processor, GPGPU (general purpose graphics processing unit), a high-throughput many integrated core (MIC) coprocessor (including 30 or more cores), embedded processor, or the like. The processor may be implemented on one or more chips. The processor 1100 may be a part of and/or may be implemented on one or more substrates using any of a number of process technologies, such as, for example, BiCMOS, CMOS, or NMOS.
The memory hierarchy includes one or more levels of cache within the cores 1104A-N, a set or one or more shared cache units 1106, and external memory (not shown) coupled to the set of integrated memory controller units 1114. The set of shared cache units 1106 may include one or more mid-level caches, such as level 2 (L2), level 3 (L3), level 4 (L4), or other levels of cache, a last level cache (LLC), and/or combinations thereof. While in one embodiment a ring based interconnect unit 1112 interconnects the integrated graphics logic 1108, the set of shared cache units 1106, and the system agent unit 1110/integrated memory controller unit(s) 1114, alternative embodiments may use any number of well-known techniques for interconnecting such units. In one embodiment, coherency is maintained between one or more cache units 1106 and cores 1102-A-N.
In some embodiments, one or more of the cores 1102A-N are capable of multi-threading. The system agent 1110 includes those components coordinating and operating cores 1102A-N. The system agent unit 1110 may include for example a power control unit (PCU) and a display unit. The PCU may be or include logic and components needed for regulating the power state of the cores 1102A-N and the integrated graphics logic 1108. The display unit is for driving one or more externally connected displays.
The cores 1102A-N may be homogenous or heterogeneous in terms of architecture instruction set; that is, two or more of the cores 1102A-N may be capable of execution the same instruction set, while others may be capable of executing only a subset of that instruction set or a different instruction set.
Exemplary Computer Architectures
Referring now to
The optional nature of additional processors 1215 is denoted in
The memory 1240 may be, for example, dynamic random access memory (DRAM), phase change memory (PCM), or a combination of the two. For at least one embodiment, the controller hub 1220 communicates with the processor(s) 1210, 1215 via a multi-drop bus, such as a frontside bus (FSB), point-to-point interface, or similar connection 1295.
In one embodiment, the coprocessor 1245 is a special-purpose processor, such as, for example, a high-throughput MIC processor, a network or communication processor, compression engine, graphics processor, GPGPU, embedded processor, or the like. In one embodiment, controller hub 1220 may include an integrated graphics accelerator.
There can be a variety of differences between the physical resources 1210, 12155 in terms of a spectrum of metrics of merit including architectural, microarchitectural, thermal, power consumption characteristics, and the like.
In one embodiment, the processor 1210 executes instructions that control data processing operations of a general type. Embedded within the instructions may be coprocessor instructions. The processor 1210 recognizes these coprocessor instructions as being of a type that should be executed by the attached coprocessor 1245. Accordingly, the processor 1210 issues these coprocessor instructions (or control signals representing coprocessor instructions) on a coprocessor bus or other interconnect, to coprocessor 1245. Coprocessor(s) 1245 accept and execute the received coprocessor instructions.
Referring now to
Processors 1370 and 1380 are shown including integrated memory controller (IMC) units 1372 and 1382, respectively. Processor 1370 also includes, as part of its bus controller units, point-to-point (P-P) interfaces 1376 and 1378; similarly, second processor 1380 includes P-P interfaces 1386 and 1388. Processors 1370, 1380 may exchange information via a point-to-point (P-P) interface 1350 using P-P interface circuits 1378, 1388. As shown in
Processors 1370, 1380 may each exchange information with a chipset 1390 via individual P-P interfaces 1352, 1354 using point to point interface circuits 1376, 1394, 1386, 1398. Chipset 1390 may optionally exchange information with the coprocessor 1338 via a high-performance interface 1392. In one embodiment, the coprocessor 1338 is a special-purpose processor, such as, for example, a high-throughput MIC processor, a network or communication processor, compression engine, graphics processor, GPGPU, embedded processor, or the like.
A shared cache (not shown) may be included in either processor or outside of both processors, yet connected with the processors via P-P interconnect, such that either or both processors' local cache information may be stored in the shared cache if a processor is placed into a low power mode.
Chipset 1390 may be coupled to a first bus 1316 via an interface 1396. In one embodiment, first bus 1316 may be a Peripheral Component Interconnect (PCI) bus, or a bus such as a PCI Express bus or another I/O interconnect bus, although the scope of the present invention is not so limited.
As shown in
Referring now to
Referring now to
Embodiments of the mechanisms disclosed herein may be implemented in hardware, software, firmware, or a combination of such implementation approaches. Embodiments of the invention may be implemented as computer programs or program code executing on programmable systems comprising at least one processor, a storage system (including volatile and non-volatile memory and/or storage elements), at least one input device, and at least one output device.
Program code, such as code 1330 illustrated in
The program code may be implemented in a high level procedural or object oriented programming language to communicate with a processing system. The program code may also be implemented in assembly or machine language, if desired. In fact, the mechanisms described herein are not limited in scope to any particular programming language. In any case, the language may be a compiled or interpreted language.
One or more aspects of at least one embodiment may be implemented by representative instructions stored on a machine-readable medium which represents various logic within the processor, which when read by a machine causes the machine to fabricate logic to perform the techniques described herein. Such representations, known as “IP cores” may be stored on a tangible, machine readable medium and supplied to various customers or manufacturing facilities to load into the fabrication machines that actually make the logic or processor.
Such machine-readable storage media may include, without limitation, non-transitory, tangible arrangements of articles manufactured or formed by a machine or device, including storage media such as hard disks, any other type of disk including floppy disks, optical disks, compact disk read-only memories (CD-ROMs), compact disk rewritable's (CD-RWs), and magneto-optical disks, semiconductor devices such as read-only memories (ROMs), random access memories (RAMS) such as dynamic random access memories (DRAMs), static random access memories (SRAMs), erasable programmable read-only memories (EPROMs), flash memories, electrically erasable programmable read-only memories (EEPROMs), phase change memory (PCM), magnetic or optical cards, or any other type of media suitable for storing electronic instructions.
Accordingly, embodiments of the invention also include non-transitory, tangible machine-readable media containing instructions or containing design data, such as Hardware Description Language (HDL), which defines structures, circuits, apparatuses, processors and/or system features described herein. Such embodiments may also be referred to as program products. Emulation (including binary translation, code morphing, etc.)
In some cases, an instruction converter may be used to convert an instruction from a source instruction set to a target instruction set. For example, the instruction converter may translate (e.g., using static binary translation, dynamic binary translation including dynamic compilation), morph, emulate, or otherwise convert an instruction to one or more other instructions to be processed by the core. The instruction converter may be implemented in software, hardware, firmware, or a combination thereof. The instruction converter may be on processor, off processor, or part on and part off processor.
Example 1 provides a processor including: fetch circuitry to fetch an instruction having an opcode, an order, a destination identifier, and N source identifiers, N being equal to the order, and the order being one of two, three, and four, decode circuitry to decode the fetched instruction, execution circuitry, for each element of the identified destination, to: generate N squares by squaring each corresponding element of the N identified sources, and generate a sum of the N squares and previous contents of the element.
Example 2 includes the substance of the exemplary processor of Example 1, wherein the execution circuit uses a chain of N two-way fused multiply adders to generate the N squares and the sum.
Example 3 includes the substance of the exemplary processor of Example 1, wherein the execution circuit uses N two-input multipliers to generate the N squares in parallel, and uses a N-plus-one-input adder to generate the sum.
Example 4 includes the substance of the exemplary processor of Example 1, wherein the order is specified by one the opcode, an opcode prefix, an opcode suffix, and an immediate.
Example 5 includes the substance of the exemplary processor of Example 1, wherein each element of the identified destination and the N identified sources includes a fixed size, the instruction further including a precision operand to specify the fixed size.
Example 6 includes the substance of the exemplary processor of Example 1, wherein each element of the identified destination and the N identified sources includes a floating point value.
Example 7 includes the substance of the exemplary processor of Example 1, wherein the instruction further includes a writemask, the writemask being a multi-bit value with each bit to control, for each element of the identified destination, whether the sum is stored to the element.
Example 8 includes the substance of the exemplary processor of Example 1, wherein the destination identifier and the N source identifiers each specifies a vector register having a vector length, wherein the vector length includes one of 128 bits, 256 bits, and 512 bits, and wherein the instruction further specifies the vector length using one of the opcode, a prefix to the opcode, and an immediate.
Example 9 includes the substance of the exemplary processor of Example 1, wherein the identified destination is zeroed after reset.
Example 10 includes the substance of the exemplary processor of Example 1, wherein the execution circuit is to execute the decoded instruction over multiple cycles, processing a subset of the elements of the identified destination on each cycle.
Example 11 provides a method including: fetching, using fetch circuitry, an instruction having an opcode, an order, a destination identifier, and N source identifiers, N being equal to the order, and the order being one of two, three, and four, decoding, using decode circuitry, the fetched instruction, executing, by execution circuitry, to, for each element of the identified destination: generate N squares by squaring each corresponding element of the N identified sources, and generate a sum of the N squares and previous contents of the element.
Example 12 includes the substance of the exemplary method of Example 11, further including using, by the execution circuit, a chain of N two-way fused multiply adders to generate the N squares and the sum.
Example 13 includes the substance of the exemplary method of Example 11, further including using, by the execution circuit, N two-input multipliers to generate the N squares in parallel, and a N-plus-one-input adder to generate the sum.
Example 14 includes the substance of the exemplary method of Example 11, wherein the order is specified by one the opcode, an opcode prefix, an opcode suffix, and an immediate.
Example 15 includes the substance of the exemplary method of Example 11, wherein each element of the identified destination and the N identified sources includes a fixed size, the instruction further including a precision operand to specify the fixed size.
Example 16 includes the substance of the exemplary method of Example 11, wherein each element of the identified destination and the N identified sources includes a floating point value.
Example 17 includes the substance of the exemplary method of claim 11, wherein the instruction further includes a writemask, the writemask being a multi-bit value with each bit to control, for each element of the identified destination, whether the sum is stored to the element.
Example 18 includes the substance of the exemplary method of Example 11, wherein the destination identifier and the N source identifiers each specifies a vector register having a vector length, wherein the vector length includes one of 128 bits, 256 bits, and 512 bits, and wherein the instruction further specifies the vector length using one of the opcode, a prefix to the opcode, and an immediate.
Example 19 includes the substance of the exemplary method of Example 11, wherein the identified destination is zeroed after reset.
Example 20 includes the substance of the exemplary method of Example 11, wherein the execution circuit is to execute the decoded instruction over multiple cycles, processing a subset of the elements of the identified destination on each cycle.
Example 21 provides an apparatus including: means for fetching an instruction having an opcode, an order, a destination identifier, and N source identifiers, N being equal to the order, and the order being one of two, three, and four, means for decoding the fetched instruction, means for executing to, for each element of the identified destination: generate N squares by squaring each corresponding element of the N identified sources, and generate a sum of the N squares and previous contents of the element.
Example 22 includes the substance of the exemplary apparatus of Example 21, wherein the means for executing uses a chain of N two-way fused multiply adders to generate the N squares and the sum.
Example 23 includes the substance of the exemplary apparatus of Example 21, wherein the means for executing uses N two-input multipliers to generate the N squares in parallel, and uses a N-plus-one-input adder to generate the sum.
Example 24 includes the substance of the exemplary apparatus of Example 21, wherein the order is specified by one the opcode, an opcode prefix, an opcode suffix, and an immediate.
Example 25 includes the substance of the exemplary apparatus of Example 21, wherein each element of the identified destination and the N identified sources includes a fixed size, the instruction further including a precision operand to specify the fixed size.
Example 26 includes the substance of the exemplary apparatus of Example 21, wherein each element of the identified destination and the N identified sources includes a floating point value.
Example 27 includes the substance of the exemplary apparatus of Example 21, wherein the instruction further includes a writemask, the writemask being a multi-bit value with each bit to control, for each element of the identified destination, whether the sum is stored to the element.
Example 28 includes the substance of the exemplary apparatus of Example 21, wherein the destination identifier and the N source identifiers each specifies a vector register having a vector length, wherein the vector length includes one of 2128 bits, 256 bits, and 512 bits, and wherein the instruction further specifies the vector length using one of the opcode, a prefix to the opcode, and an immediate.
Example 29 includes the substance of the exemplary apparatus of Example 21, wherein the identified destination is zeroed after reset.
Example 30 includes the substance of the exemplary apparatus of Example 21, wherein the execution circuit is to execute the decoded instruction over multiple cycles, processing a subset of the elements of the identified destination on each cycle.
Example 31 provides a non-transitory computer-readable medium containing instructions that, when execute by a processor, cause the processor to: fetch, using fetch circuitry, an instruction having an opcode, an order, a destination identifier, and N source identifiers, N being equal to the order, and the order being one of two, three, and four, decode, using decode circuitry, the fetched instruction, execute, by execution circuitry, to, for each element of the identified destination: generate N squares by squaring each corresponding element of the N identified sources, and generate a sum of the N squares and previous contents of the element.
Example 32 includes the substance of the exemplary non-transitory computer-readable medium of Example 31, further including using, by the execution circuit, a chain of N two-way fused multiply adders to generate the N squares and the sum.
Example 33 includes the substance of the exemplary non-transitory computer-readable medium of Example 31, further including using, by the execution circuit, N two-input multipliers to generate the N squares in parallel, and a N-plus-one-input adder to generate the sum.
Example 34 includes the substance of the exemplary non-transitory computer-readable medium of Example 31, wherein the order is specified by one the opcode, an opcode prefix, an opcode suffix, and an immediate.
Example 35 includes the substance of the exemplary non-transitory computer-readable medium of Example 31, wherein each element of the identified destination and the N identified sources includes a fixed size, the instruction further including a precision operand to specify the fixed size.
Example 36 includes the substance of the exemplary non-transitory computer-readable medium of Example 31, wherein each element of the identified destination and the N identified sources includes a floating point value.
Example 37 includes the substance of the exemplary non-transitory computer-readable medium of Example 31, wherein the instruction further includes a writemask, the writemask being a multi-bit value with each bit to control, for each element of the identified destination, whether the sum is stored to the element.
Example 38 includes the substance of the exemplary non-transitory computer-readable medium of Example 31, wherein the destination identifier and the N source identifiers each specifies a vector register having a vector length, wherein the vector length includes one of 128 bits, 256 bits, and 512 bits, and wherein the instruction further specifies the vector length using one of the opcode, a prefix to the opcode, and an immediate.
Example 39 includes the substance of the exemplary non-transitory computer-readable medium of Example 31, wherein the identified destination is zeroed after reset.
Example 40 includes the substance of the exemplary non-transitory computer-readable medium of Example 31, wherein the execution circuit is to execute the decoded instruction over multiple cycles, processing a subset of the elements of the identified destination on each cycle.