There are many inventions described and illustrated herein. The present inventions are neither limited to any single aspect nor embodiment thereof, nor to any combinations and/or permutations of such aspects and/or embodiments. Importantly, each of the aspects of the present inventions, and/or embodiments thereof, may be employed alone or in combination with one or more of the other aspects of the present inventions and/or embodiments thereof. All combinations and permutations thereof are intended to fall within the scope of the present inventions.
In one aspect, the present inventions are directed to an integrated circuit having (i) a multiplier circuit array and/or (ii) one or more multiplier-accumulator circuits, wherein each multiplier-accumulator circuit includes/include a distinct or separate multiplier circuit array to implement or perform multiply operations (e.g., multiply input data and filter weights having a floating point data format). The multiplier circuit array includes a plurality of interconnected multiplier circuits. The multiplier circuits, in one embodiment, are disposed adjacent each other and are interconnected, for example, via a dedicated multi-drop or point-to-point bus. In another embodiment, the multiplier circuit array includes a first multiplier circuit having circuitry including a second multiplier circuit incorporated or embedded therein (e.g., the multiply core of the second multiplier circuit is incorporated or embedded into the circuitry of first multiplier circuit).
In operation, the plurality of interconnected multiplier circuits of the multiplier circuit array perform the multiply operation of the multiplier-accumulator circuit. For example, in one embodiment, a first multiplier circuit of the multiplier circuit array performs a first portion of the multiply operation (e.g., in the context of a floating point data format, values of the sign bit fields and the exponent fields of, for example, the input data and filter weights) and one or more other multiplier circuits of the multiplier circuit array process or perform one or more other portions of the multiply operation (e.g., in the context of a floating point data format, values of fraction fields of, for example, the input data and filter weights—via, for example, two's complement multiplication). Thereafter, the “product” or output of each multiplier circuit may be “combined” or “joined” into data having a particular data format. For example, the output of the first multiplier circuit corresponding to the first portion of the multiply operation (e.g., values of the sign bit fields and the exponent fields of two operands) may be combined or joined with the output of a second multiplier circuit corresponding to a second portion of the multiply operation (e.g., values of fraction fields of the two operands—via, for example, two's complement multiplication) to form or construct a composite product/output having sign, exponent, and fraction fields.
Thus, in one embodiment, the operands are deconstructed into predetermined fields (e.g., sign, exponent and fractions), wherein the related fields of the operands are multiplied by one of the plurality of multiplier circuits. Thereafter, the “product” or output of each multiplier circuit may be “combined” or “joined”—wherein the data from the multiplier circuits are “reconstructed” into product data having a predetermined data format (e.g., programmable format—one-time or more than one-time). The predetermined format of the product data (e.g., (i) floating point or integer type and/or (ii) bit lengths of the fields) may or may not be the same as one or both of the operands.
The plurality of interconnected multiplier circuits of the multiplier circuit array, in one embodiment, may perform the multiply operations based on different data formats (e.g., a first multiplier circuit may be a floating point type and a second multiplier circuit may be an integer type). Moreover, the plurality of interconnected multiplier circuits, in one embodiment, may include the same or different multiplication precisions (e.g., a first multiplier circuit may be an x-bit floating point type (e.g., 32 bit floating point type multiplier circuit) and a second multiplier circuit may be a y-bit integer type (wherein y may or may not equal x; e.g., a 32 bit integer type multiplier circuit, a 24 bit integer type multiplier circuit, or a 16 bit integer type multiplier circuit). Indeed, in one embodiment, the multiplier circuit array may include three or more multiplier circuits wherein each multiplier circuit includes a different multiplication precision (e.g., a first multiplier circuit may be an x-bit floating point type (e.g., 32 bit floating point type multiplier circuit), a second multiplier circuit may be a y-bit integer type (e.g., a 16 or 24 bit integer type multiplier circuit), and a third multiplier circuit may be a z-bit integer type (e.g., an 8 bit integer type multiplier circuit).
Notably, each multiplier circuit of the multiplier circuit array may be a complete and fully functional/capable multiplier circuit or may be a partial multiplier circuit including only certain or selected circuitry of a complete and fully functional/capable multiplier circuit (e.g., omission of: (i) circuitry to perform the multiply operation corresponding to sign fields of the operands, and/or (ii) circuitry to perform the multiply operation corresponding to the exponent fields of the operands and/or (iii) circuitry to perform the multiply operation corresponding to the fraction fields of the operands). For example, in one embodiment, a first multiplier circuit may be an x-bit floating point type (e.g., a 32 bit or 24 bit floating point type multiplier circuit) which processes or performs a first portion of the multiply operation (e.g., values of the sign bit fields and the exponent fields of, for example, the input data and filter weights) and a second multiplier circuit may be a y-bit integer type (e.g., a 32 bit, 24 bit or 16 bit integer type multiplier circuit) which processes or performs one or more other portions of the multiply operation (e.g., values of fraction fields of, for example, the input data and filter weights). In this exemplary embodiment, the first multiplier circuit may not include circuitry to perform the one or more portions of the multiply operation that is to be performed by the second multiplier circuit (e.g., circuitry associated with the multiply operation of the values of fraction fields of, for example, the input data and filter weights). Similarly, the second multiplier circuit may not include circuitry to perform the one or more portions of the multiply operation that is to be performed by the first multiplier circuit (e.g., circuitry associated with the multiply operation corresponding to the values of the sign bit fields and the exponent fields of, for example, the input data and filter weights).
The multiplier circuits of the multiplier circuit array may be interconnected via conductors (e.g., one or more buses (e.g., point-to-point and/or multi-drop)). For example, in one embodiment, at least one of the multiplier circuits of the multiply circuit array outputs data of the product resulting from the multiply operation (e.g., the output of the multiply operation of the values of fraction fields of, for example, the input data and filter weights) to another multiplier circuit of the multiply circuit array. Notably, in one embodiment, the conductors may also communicate control (e.g., rounding information/data, outputs from fraction detection logic to detect, for example, special values/operands such as ZRO (zero), NAN (not a number), EOVFL (exponent overflow), EUNFL (exponent underflow) and/or INF (infinity)).
One or more (or all) of the multiplier circuits of the multiplier circuit array may also include rounding circuitry to round the resultant product of the multiply operation to generate or provide a predetermined bit length, size or precision of the fraction field of the output data. For example, where the output data includes a floating point data format having a bit length, size or precision of 32 bits, the multiply operation of the two operands may generate more bits corresponding to the fraction field and suitable or defined for 32 bit floating point data. Here, the rounding circuitry generates or provides rounding data which is employed to round the fraction field of the operand of the product to an appropriate bit length, size or precision corresponding to the data format (e.g., in the context of a 32 bit floating point data format, a 23 bit fraction field). Thus, in one embodiment, the rounding circuitry generates data/information to round the resultant product of the fraction fields of the operands.
In addition, in one embodiment, at least one of the multiplier circuits of the multiplier circuit array, using data generated from the multiply operations in the plurality of multiplier circuits, may include circuitry to generate, form or construct the output data having (i) a sign bit of the resultant product, and (ii) a value of an exponent field of the resultant product having a predetermined bit length, size or precision, and (ii) a value of a fraction field of the resultant product having a predetermined bit length, size or precision. That multiplier circuit may acquire or obtain the data from the other multiplier circuit(s) of the array via interconnect conductors (e.g., one or more buses (e.g., point-to-point and/or multi-drop)). Thereafter, the output data generated by the multiplier circuit array, which is the “final” product value resulting from the multiply operation of the two operands having the predefined or predetermined bit length, size or precision of the data format (i.e., a 32 bit floating point data format having a sign bit, an eight bit exponent field and a 23 bit fraction field), may be output on a bus and available to other circuitry, for example, for additional processing. In one embodiment, the output data is provided to an accumulator circuit of a multiplier-accumulator circuit of, for example, a data processing pipeline. Notably, multiplier-accumulator circuits may be referred to herein, at times, as “MACs” or “MAC circuits”, and singularly/individually as “MAC” or “MAC circuit”.
In another exemplary embodiment, the multiplier circuit array may include three (or more) multiplier circuits including, for example, a first multiplier circuit of an x-bit floating point type (e.g., a 32 bit floating point type multiplier circuit) which processes or performs a first portion of the multiply operation (e.g., values of the sign bit fields and the exponent fields of, for example, the input data and filter weights), a second multiplier circuit of a y-bit integer type (e.g., a 16 bit integer type multiplier circuit) which processes or performs one or more portions of the multiply operation (e.g., multiply operations of each first portion of the fraction fields of, for example, the input data and the fraction field portions of the filter weights) and a third multiplier circuit of a z-bit integer type (e.g., an 8 bit integer type multiplier circuit) which processes or performs one or more other portions of the multiply operation (e.g., multiply operations of each second portion of the fraction fields of, for example, the input data and the fraction field portions of the filter weights). In this embodiment, the second and third multiplier circuits may both perform multiply operations of different portions of fraction fields of the operands (e.g., the fractional field portion of the input data and the filter weights). For example, the second multiplier circuit may multiply the most significant bits (MSBs) of the fraction fields of the operands and the third multiplier circuit may multiply the remaining bits (in this example, least signification bits (LSBs)) of the fraction fields of the operands (e.g., the second multiplier circuit may perform or implement the multiply operation with respect to the 16 MSBs and the second multiplier circuit may perform or implement the multiply operation with respect to the 8 LSBs).
In another embodiment, the multiplier circuit array may include three (or more) multiplier circuits—wherein only two of the multiplier circuits are employed in the multiplication operation. For example, a first multiplier circuit of an x-bit floating point type (e.g., a 32 bit floating point type multiplier circuit) which processes or performs a first portion of the multiply operation (e.g., values of the sign bit fields and the exponent fields of, for example, the input data and filter weights), a second multiplier circuit of a y-bit integer type (e.g., an 8 bit integer type multiplier circuit) which may be employed to processes or performs a second portion of the multiply operation (e.g., multiply operations the fraction fields of, for example, the input data and the fraction fields of the filter weights) and a third multiplier circuit of a z-bit floating point type (e.g., an 16 bit floating point type multiplier circuit) which processes or performs the first portion or the second portion of the multiply operation (e.g., multiply operations of (i) sign bit fields and the exponent fields of, for example, the input data and filter weights, or (ii) fraction fields of, for example, the input data and the fraction field portions of the filter weights). In this embodiment, the first and second multiplier may be employed to perform the multiply operation, the first and the third multiplier circuits may be employed to perform multiply operations, or the second and third multiplier circuits may be employed to perform multiply operations. Thereafter, the “product” or output of each multiplier circuit may be “combined” or “joined”—wherein the data from the multiplier circuits are “reconstructed” into product data having a predetermined data format. The predetermined format of the product data (e.g., (i) floating point or integer type and/or (ii) bit lengths of the fields) may or may not be the same as one or both of the operands—and which format may be programmable (e.g., one-time or more than one-time).
Each multiplier circuit of the multiplier circuit array may include enable/disable circuitry and/or select/deselect circuitry to facilitate the operable configuration of the multiplier circuit array to implement a predetermined multiply operation (e.g., operations performed having a predetermined data format and using a predetermined precision to, for example, provide output data (resultant product) having a predetermined format and predetermined precision). For example, where one or more of the multiplier circuit(s) of the multiplier circuit array is/are employed or incorporated in the multiply operations, such multiplier circuit(s) is/are enable and configured to process or perform a portion of the multiply operation (e.g., values of the sign bit fields and the exponent fields of, for example, the input data and filter weights) and one or more other multiplier circuit(s) of the multiplier circuit array perform one or more other portions of the multiply operation (e.g., values of fraction fields of, for example, the input data and filter weights). In the event that one or more of the multiplier circuit(s) of the array is/are not employed or utilized in performance of the multiply operations, such one or more of the multiplier circuit(s) is/are deselected and, in one embodiment, disabled (e.g., de-coupled from the input and output bus, de-coupled from the interconnection bus and/or electrically powered-down).
The configuration of the multiplier circuit array may be user or system defined and/or may be one-time programmable/configurable (e.g., at manufacture) or more than one-time programmable/configurable (e.g., (i) at or via power-up, start-up or performance/completion of the initialization sequence/process sequence, and/or (ii) in situ (i.e., during operation of the integrated circuit), at manufacture, and/or at or during power-up, start-up, initialization, re-initialization, configuration, re-configuration or the like). In one embodiment, control circuitry is employed to program/configure the multiplier circuit array including the plurality of multiplier circuits. The control circuitry, in one embodiment, programs/configures the multiplier circuit array one-time; in another embodiment, the control circuitry programs/configures the multiplier circuit array more than one-time (i.e., multiple times). For example, the control circuity may receive select and/or enable signals from internal or external circuitry (i.e., external to the one or more integrated circuits—for example, a host computer/processor) including one or more data storage circuits (e.g., one or more memory cells, register, flip-flop, latch, block/array of memory), one or more input pins/conductors, a look-up table LUT (of any kind), a processor or controller and/or discrete control logic. The control circuitry, in response thereto, may employ such signal(s) to enable or disable selected multiplier circuits of the multiplier circuit array and thereby configure the multiplier circuitry of, for example, the MAC or MACs of a data processing pipeline, to implement the multiply operations. The control circuitry may configure the multiplier circuitry in situ and/or at or during power-up, start-up, initialization, re-initialization, configuration, re-configuration or the like. Indeed, in one embodiment, control circuitry may evaluate the input data and, based thereon, implement or select a configuration of the multiplier circuit array to provide the appropriate configuration to implement or provide a predetermined precision and data format of the resultant multiplication product (output data).
For example, the multiplier circuit array may include, a first multiplier circuit of an x-bit floating point type (e.g., a 32 bit floating point type multiplier circuit), which processes a first portion of the multiply operation (e.g., values of the sign bit fields and the exponent fields of, for example, the input data and filter weights), a second multiplier circuit of a y-bit integer type (e.g., a 24 bit integer type multiplier circuit), which processes one or more portions of the multiply operation (e.g., values of fraction fields of, for example, the input data and filter weights) and a third multiplier circuit may be a y-bit integer type (e.g., an 8 bit integer type multiplier circuit) which processes one or more portions of the multiply operation (e.g., values of fraction fields of, for example, the input data and filter weights). Where the precision and data format of the input data and filter weights are 16 bit floating point data format, control circuitry may enable and select the first multiplier circuit and one of the second or third multiplier circuits of the multiplier circuit array to implement the multiply operations of the multiplier circuitry. Here, the control circuitry may configure the multiplier circuit array so that the first multiplier circuit performs or implements the multiply operation in connection with the values of the sign bit fields and the exponent fields of, for example, the input data and filter weights, and the second multiplier circuit or the third multiplier circuit of the multiplier circuit array performs or implements the multiply operation in connection with the values of fraction fields of the input data and filter weights (in this example, a 8×8 multiply operation).
In another embodiment, where the precision and data format of the input data and filter weights have a 24 bit floating point data format, control circuitry may enable and select the first and second multiplier circuits of the multiplier circuit array to implement the multiply operations of the multiplier circuitry. Here, the first multiplier circuit may perform or implement the multiply operation in connection with the values of the sign bit fields and the exponent fields of, for example, the input data and filter weights, and the second multiplier circuit of the multiplier circuit array may perform or implement the multiply operation in connection with the values of fraction fields of the input data and filter weights (in this example, a 16×16 multiply operation). Notably, the third multiplier circuit (in this exemplary embodiment, an 8 bit integer type multiplier circuit), does not have the capacity to efficiently multiply the 15 bit values of each fraction field of the input data and filter weights. Thus, the control circuitry enables and selects the first and second multiplier circuits of the multiplier circuit array which communicate via the interconnect conductors disposed therebetween.
In another embodiment, where the precision and data format of the input data and filter weights have a 16 bit floating point data format, control circuitry may enable and select the first and third multiplier circuits of the multiplier circuit array to implement the multiply operations of the multiplier circuitry. Here, the first multiplier circuit may perform or implement the multiply operation in connection with the values of the sign bit fields and the exponent fields of, for example, the input data and filter weights, and the third multiplier circuit of the multiplier circuit array may perform or implement the multiply operation in connection with the values of fraction fields of the input data and filter weights (in this example, a 8×8 multiply operation). Notably, the third multiplier circuit (in this exemplary embodiment, an 8 bit integer type multiplier circuit), includes the capacity to efficiently multiply the 7 bit values of each fraction field of the input data and filter weights. Alternatively, the second multiplier circuit of the multiplier circuit array may perform or implement the multiply operation in connection with the values of fraction fields of the input data and filter weights (in this example, a 16×16 multiply operation)—however, it may be more efficient (power and timing) to employ the 8 bit integer type multiplier circuit given the difference in bit size of the multiply core (8 bit vs. 16 bit) circuit. Thus, the control circuitry enables and selects the first and third multiplier circuits of the multiplier circuit array which communicate via the interconnect conductors disposed therebetween.
In yet another embodiment, where the precision and data format of the input data and filter weights are 32 bit floating point, control circuitry may enable and select the first, second and third multiplier circuits of the multiplier circuit array to implement the multiply operations of the multiplier circuitry. Here, the first multiplier circuit may perform or implement the multiply operation in connection with the values of the sign bit fields and the exponent fields of, for example, the input data and filter weights, and the second multiplier circuit of the multiplier circuit array may perform or implement the multiply operation in connection with the values of a first portion of the fraction fields of the input data and filter weights, and the third multiplier circuit of the multiplier circuit array may perform or implement the multiply operation in connection with the values of a second portion of the fraction fields of the input data and filter weights.
As discussed above, the multiplier circuits of the multiplier circuit array may be programmed/configured via control circuitry, for example, in situ (i.e., during operation of the integrated circuit), at manufacture, and/or at or during power-up, start-up, initialization, re-initialization, configuration, re-configuration or the like.
Notably, the multiplier circuit array of the present inventions may be incorporated and/or implemented in one or more (or all) multiplier-accumulator circuits of an execution or processing pipeline including execution circuitry employing one or more floating point data formats. Here, in another aspect of the present inventions, the multiplier-accumulator circuit(s) may include a multiplier circuit array (which, in one embodiment, is configurable to provide a predetermined precision of the resultant multiplication product (output data)). The multiplier circuit array may include a floating point type multiplier and an integer type multiplier. The output of the multiplier circuit array, having a floating point data format, may be provided to the accumulator circuit, which is a floating point type accumulator. In one embodiment, the execution or processing pipeline includes a plurality of multiplier-accumulator circuits, each circuit including a multiplier circuit array (e.g., having an identical configuration). For example, the plurality of multiplier-accumulator circuits (each having multiplier circuit array) may be interconnected (in series) to perform the multiply and accumulate operations and/or the pipelining architecture or configuration implemented via connection of multiplier-accumulator circuits. In this pipeline architecture, for example, the plurality of multiplier-accumulator circuits may concatenate the multiply and accumulate operations of the data processing.
The multiplier circuit array of the present inventions may be employed and/or implemented in the circuitry described and/or illustrated in U.S. patent application Ser. Nos. 16/545,345, 17/019,212 and/or 17/391,082. Here, the multiplier circuit array of the present inventions may be incorporated into the multiplier circuitry of the multiplier-accumulator circuit described and/or illustrated in the '345, '212 and/or '082 applications to, for example, facilitate concatenating the multiply and accumulate operations, and reconfiguring the circuitry thereof and operations performed thereby (see, e.g., the exemplary embodiments illustrated in FIGS. 1A-1C of U.S. patent application Ser. No. 16/545,345). In this way, each multiplier-accumulator circuit includes a multiplier circuit array to, for example, process data (e.g., image data) in a manner whereby the processing and operations are performed as described herein. The '345, '212 and '082 applications are incorporated by reference herein in their entirety.
The multiplier circuit array of the present inventions may also be employed and/or implemented in the multiplier-accumulator circuits of the processing pipelines or architectures, and circuitry to configure and control such pipelines/architectures, described and/or illustrated in U.S. patent application Ser. Nos. 17/019,212 and 17/391,082. In this regard, the multiplier circuitry of the multiplier-accumulator circuits may include the multiplier circuit array described and illustrated herein; as noted above, the '212 and '082 applications are incorporated by reference in their entirety.
Further, the multiplier-accumulator circuits (having the multiplier circuit array of the present inventions described and/or illustrated herein) may be interconnected into execution or processing pipelines as described and/or illustrated in U.S. patent application Ser. No. 17/212,411; the '411 application is incorporated by reference herein in its entirety. In one embodiment, the circuitry configures and controls a plurality of separate multiplier-accumulator circuits (each having a multiplier circuit array of the present inventions) or rows/banks of such multiplier-accumulator circuits (which are interconnected, for example, in series (such rows/banks thereof are referred to, at times, as clusters) to pipeline multiply and accumulate operations. In one embodiment, the plurality of multiplier-accumulator circuits (having the multiplier circuit array) may include a plurality of registers (including a plurality of shadow registers) wherein the circuitry also controls such registers to implement or facilitate the pipelining of the multiply and accumulate operations performed by the multiplier-accumulator circuits to increase throughput of the multiplier-accumulator execution or processing pipelines in connection with processing the related data (e.g., image data). (See, e.g., '345 application).
In another embodiment, the interconnection of the pipeline or pipelines (each including a plurality of multiplier-accumulator circuits (having the multiplier circuit array of the present inventions described and/or illustrated herein) may be configurable or programmable to provide different forms of pipelining, as described and/or illustrated in U.S. patent application Ser. No. 17/212,411). Here, the pipelining architecture provided by the interconnection of the plurality of multiplier-accumulator circuits (having the multiplier circuit array of the present inventions described and/or illustrated herein) may be controllable or programmable. In this way, a plurality of multiplier-accumulator circuits, each circuit having a multiplier circuit array of the present inventions described and/or illustrated herein, may be configured and/or re-configured to form or provide the desired processing pipeline(s) to process data (e.g., image data). For example, with reference to the '411 application, in one embodiment, control/configure circuitry may configure or determine the multiplier-accumulator circuits having multiplier circuit array described herein, or rows/banks of interconnected multiplier-accumulator circuits having a multiplier circuit array described herein are interconnected (in series) to perform the multiply and accumulate operations and/or the pipelining architecture or configuration implemented via connection of multiplier-accumulator circuits (or rows/banks of interconnected multiplier-accumulator circuits). Thus, in one embodiment, the control/configure circuitry configures or implements an architecture of the execution or processing pipeline by controlling or providing connection(s) between such multiplier-accumulator circuits and/or such rows of interconnected multiplier-accumulator circuits—each of which include one or more multiplier circuit array embodiments described herein.
Notably, the circuitry of the present inventions may be disposed on or in integrated circuit(s), for example, (i) a processor, controller, state machine, gate array, system-on-chip (“SOC”), programmable gate array (“PGA”) and/or field programmable gate array (“FPGA”), and/or (ii) a processor, controller, state machine and SOC including an embedded FPGA, and/or (iii) an integrated circuit (e.g., processor, controller, state machine and SoC)—including an embedded processor, controller, state machine, and/or PGA. Indeed, the circuitry of the present inventions may be disposed on or in integrated circuit(s) dedicated exclusively to such circuitry.
The present inventions may be implemented in connection with embodiments illustrated in the drawings hereof. These drawings show different aspects of the present inventions and, where appropriate, reference numerals, nomenclature, or names illustrating like circuits, architectures, structures, components, materials and/or elements in different figures are labeled similarly. It is understood that various combinations of the structures, components, materials and/or elements, other than those specifically shown, are contemplated and are within the scope of the present inventions.
Moreover, there are many inventions described and illustrated herein. The present inventions are neither limited to any single aspect nor embodiment thereof, nor to any combinations and/or permutations of such aspects and/or embodiments. Moreover, each of the aspects of the present inventions, and/or embodiments thereof, may be employed alone or in combination with one or more of the other aspects of the present inventions and/or embodiments thereof. For the sake of brevity, certain permutations and combinations are not discussed and/or illustrated separately herein. Notably, an embodiment or implementation described herein as “exemplary” is not to be construed as preferred or advantageous, for example, over other embodiments or implementations; rather, it is intended reflect or indicate the embodiment(s) is/are “example” embodiment(s).
Notably, the configurations, block/data width, data path width, bandwidths, data lengths, values, processes, pseudo-code, operations, and/or algorithms described herein and/or illustrated in the FIGURES, and text associated therewith, are exemplary. Indeed, the inventions are not limited to any particular or exemplary circuit, logical, block, functional and/or physical diagrams, number of multiplier-accumulator circuits employed in an execution pipeline, number of execution pipelines employed in a particular processing configuration, organization/allocation of memory, block/data width, data path width, bandwidths, values, processes, pseudo-code, operations, and/or algorithms illustrated and/or described in accordance with, for example, the exemplary circuit, logical, block, functional and/or physical diagrams.
Moreover, although the illustrative/exemplary embodiments include a plurality of memories (e.g., L3 memory, L2 memory, L1 memory, L0 memory) which are assigned, allocated and/or used to store certain data and/or in certain organizations, one or more of memories may be added, and/or one or more memories may be omitted and/or combined/consolidated—for example, the L3 memory or L2 memory, and/or the organizations may be changed, supplemented and/or modified. The inventions are not limited to the illustrative/exemplary embodiments of the memory organization and/or allocation set forth in the application. Again, the inventions are not limited to the illustrative/exemplary embodiments set forth herein.
Again, there are many inventions described and illustrated herein. The present inventions are neither limited to any single aspect nor embodiment thereof, nor to any combinations and/or permutations of such aspects and/or embodiments. Each of the aspects of the present inventions, and/or embodiments thereof, may be employed alone or in combination with one or more of the other aspects of the present inventions and/or embodiments thereof. For the sake of brevity, many of those combinations and permutations are not discussed or illustrated separately herein.
In one aspect, the present inventions are directed to one or more integrated circuits having processing circuitry, for example, multiplier-accumulator circuits (and methods of operating such circuits), to process data (e.g., filtering image data) wherein the processing circuitry includes a multiplier circuitry including a multiplier circuit array. The multiplier circuit array includes a plurality of interconnected multiplier circuits to implement or perform multiply operations in connection with data (e.g., multiply input data and filter weights). In one embodiment, the data have a floating point data format (e.g., such as 16, 24 and 32 bits). In addition thereto, or in lieu thereof, in another embodiment, the data may have a fixed point data format (e.g., integer data format having, for example, 16, 24 and 32 bits). The multiplier circuit array may include two multiplier circuits wherein each multiplier circuit includes the same or a different circuit types (e.g., floating point type and/or integer type) and/or the same or a different multiplication precision (e.g., multiplier circuit A may be, for example, a 24 bit or 32 bit; and multiplier circuit B may be, for example, a 16, 24 or 32 bit). Indeed, where the processing circuitry is a plurality of multiplier-accumulator circuits, in one embodiment, each multiplier-accumulator circuit includes a dedicated, separate or distinct multiplier circuit array, wherein each multiplier circuit array includes a plurality of interconnected multiplier circuits, to implement or perform multiply operations of the associated multiplier-accumulator circuit.
With reference to
The plurality of interconnected multiplier circuits of the multiplier circuit array, in one embodiment, may perform the multiply operations based on different data formats (e.g., a first multiplier circuit may be a floating point type and a second multiplier circuit may be an integer type). Moreover, the plurality of interconnected multiplier circuits, in one embodiment, may include the same or different multiplication precisions (e.g., a first multiplier circuit may be an x-bit floating point type (e.g., 32 bit floating point type multiplier circuit) and a second multiplier circuit may be a y-bit integer type (wherein y may or may not equal x; e.g., a 32 bit integer type multiplier circuit, a 24 bit integer type multiplier circuit, or a 16 bit integer type multiplier circuit).
The multiplier circuit array may receive the input data (a first operand, e.g., image data, and a second operand, e.g., filter weights) via a multi-drop bus. (See,
With reference to
In one embodiment, two or more of the interconnected multiplier circuits of the multiplier circuit array may perform the multiply operation of the multiplier-accumulator circuit. For example, with reference to
The plurality of interconnected multiplier circuits of the multiplier circuit array, in one embodiment, may perform the multiply operations based on same or different data formats wherein the circuitry of the multiplier circuits are the same or different data types. For example, in one embodiment, the multiplier circuit array may multiply two operands each having a floating point data format. In another embodiment, the multiplier circuit array may include two multiplier circuits wherein each multiplier circuit includes different circuit types (e.g., floating point type or integer type). For example, with continued reference to
With continued reference to
Notably, each multiplier circuit of the multiplier circuit array may be a complete and fully functional/capable multiplier circuit or may be a partial multiplier circuit including only certain or selected circuitry of a complete and fully functional/capable multiplier circuit (e.g., omission of: (i) circuitry to perform the multiply operation corresponding to sign fields and exponent fields of the operands or (ii) circuitry to perform the multiply operation corresponding to the fraction fields of the operands). For example, in one embodiment, a first multiplier circuit may be an x-bit floating point type (e.g., a 32 bit or 24 bit floating point type multiplier circuit) which processes or performs a first portion of the multiply operation (e.g., values of the sign bit fields and the exponent fields of, for example, the input data and filter weights) and a second multiplier circuit may be a y-bit integer type (e.g., a 32 bit, 24 bit or 16 bit integer type multiplier circuit) which processes or performs one or more other portions of the multiply operation (e.g., values of fraction fields of, for example, the input data and filter weights). In this exemplary embodiment, the first multiplier circuit may not include circuitry to perform the one or more portions of the multiply operation that is to be performed by the second multiplier circuit (e.g., circuitry associated with the multiply operation of the values of fraction fields of, for example, the input data and filter weights). Similarly, the second multiplier circuit may not include circuitry to perform the one or more portions of the multiply operation that is to be performed by the first multiplier circuit (e.g., circuitry associated with the multiply operation corresponding to the values of the sign bit fields and the exponent fields of, for example, the input data and filter weights).
The multiplier circuits of the multiplier circuit array may be interconnected via conductors (e.g., one or more buses (e.g., point-to-point and/or multi-drop)). With reference to
With reference to
Thus, in one embodiment, the operands are deconstructed into predetermined fields (e.g., sign, exponent and fractions), wherein the related fields of the operands are input to the multiplier circuits and multiplied thereby (e.g., each multiplier circuit performing a portion of the multiply operation). The one multiplier circuit (here, multiplier circuit A) may acquire or obtain the data from the other multiplier circuit(s) of the array (here, multiplier circuit B), via the interconnect conductors/bus (IB) (e.g., one or more buses (e.g., point-to-point and/or multi-drop)). Thereafter, the “product” or output of each multiplier circuit may be “combined” or “joined”—wherein the data from the multiplier circuits are “reconstructed” into product data having a predetermined data format (e.g., programmable format—one-time or more than one-time). That is, the output data generated by the multiplier circuit array, which is the “final” product value resulting from the multiply operation of the two operands having the predefined or predetermined bit length, size or precision of the data format (i.e., a 32 bit floating point data format having a sign bit, an eight bit exponent field and a 23 bit fraction field), may be output by multiplier circuit A on an output bus and thereafter available to other circuitry, for example, for additional processing. The predetermined format of the product data (e.g., (i) floating point or integer type and/or (ii) bit lengths of the fields) may or may not be the same as one or both of the operands. Indeed, in one embodiment, the data output by the multiplier circuit array is provided to the accumulator circuit of the associated MAC of, for example, a data processing pipeline. (See, e.g.,
As noted above, each multiplier circuit of the multiplier circuit array may be a complete and fully functional/capable multiplier circuit or may be a partial multiplier circuit including only certain or selected circuitry of a complete and fully functional/capable multiplier circuit (e.g., omission of: (i) circuitry to perform the multiply operation corresponding to sign fields and exponent fields of the operands or (ii) circuitry to perform the multiply operation corresponding to the fraction fields of the operands). In one embodiment, the multiplier circuit array may include a plurality of multiplier circuits wherein one or more of the multiplier circuit(s) is/are incorporated or embedded into another of multiplier circuit of the multiplier circuit array. For example, with reference to
Notably, the discussions relative to
With reference to
With reference to
With continued reference to
Notably, as mentioned above, the MAC (including a multiplier circuit array) of the present inventions may be employed and/or implemented in the circuitry described and/or illustrated in U.S. patent application Ser. Nos. 16/545,345, 17/019,212 and/or 17/391,082. Here, the multiplier circuit array of the present inventions may be incorporated into the multiplier circuitry of the MAC described and/or illustrated in the '345, '212 and '082 applications to, for example, facilitate concatenating the multiply and accumulate operations, and reconfiguring the circuitry thereof and operations performed thereby (see, e.g., the exemplary embodiments illustrated in FIGS. 1A-1C of U.S. patent application No. 16/545,345). In this way, each MAC includes a multiplier circuit array to, for example, process data (e.g., image data) in a manner whereby the processing and operations are performed as described herein. As noted above, the '345, '212 and '082 applications are incorporated by reference herein in their entirety.
In addition, the MACs (each including a multiplier circuit array) of the processing pipelines or architectures, and circuitry to configure and control such pipelines/architectures, described and/or illustrated in U.S. patent application Ser. No. 17/019,212. In this regard, the multiplier circuitry of the MACs may include the multiplier circuit array described and illustrated herein; again, as noted above, the '212 application is incorporated by reference herein in its entirety.
Further, the multiplier-accumulator circuits (having the multiplier circuit array of the present inventions described and/or illustrated herein) may be interconnected into execution or processing pipelines as described and/or illustrated in U.S. patent application Ser. Nos. 17/212,411 and 17/391,082; the '411 and '082 applications are incorporated by reference herein in their entirety. In one embodiment, the circuitry configures and controls a plurality of separate multiplier-accumulator circuits (each having a multiplier circuit array of the present inventions) or rows/banks of such multiplier-accumulator circuits (which are interconnected, for example, in series (such rows/banks thereof are referred to, at times, as clusters) to pipeline multiply and accumulate operations. In one embodiment, the plurality of multiplier-accumulator circuits (having the multiplier circuit array) may include a plurality of registers (including a plurality of shadow registers) wherein the circuitry also controls such registers to implement or facilitate the pipelining of the multiply and accumulate operations performed by the multiplier-accumulator circuits to increase throughput of the multiplier-accumulator execution or processing pipelines in connection with processing the related data (e.g., image data). (See, e.g., '345 application).
In another embodiment, the interconnection of the pipeline or pipelines (each including a plurality of multiplier-accumulator circuits (having the multiplier circuit array of the present inventions described and/or illustrated herein) may be configurable or programmable to provide different forms of pipelining, as described and/or illustrated in the '411 and '082 applications. Here, the pipelining architecture provided by the interconnection of the plurality of multiplier-accumulator circuits (having the multiplier circuit array of the present inventions described and/or illustrated herein) may be controllable or programmable. In this way, a plurality of multiplier-accumulator circuits, each circuit having a multiplier circuit array of the present inventions described and/or illustrated herein, may be configured and/or re-configured to form or provide the desired processing pipeline(s) to process data (e.g., image data). For example, with reference to U.S. patent application Ser. Nos. 17/212,411 and 17/391,082, in one embodiment, control/configure circuitry may configure or determine the multiplier-accumulator circuits having multiplier circuit array described herein, or rows/banks of interconnected multiplier-accumulator circuits having a multiplier circuit array described herein are interconnected (in series) to perform the multiply and accumulate operations and/or the pipelining architecture or configuration implemented via connection of multiplier-accumulator circuits (or rows/banks of interconnected multiplier-accumulator circuits). Thus, in one embodiment, the control/configure circuitry configures or implements an architecture of the execution or processing pipeline by controlling or providing connection(s) between such multiplier-accumulator circuits and/or such rows of interconnected multiplier-accumulator circuits—each of which include one or more multiplier circuit array embodiments described herein.
With reference to
With reference to
With continued reference to
The filter weights, in one exemplary embodiment, are accessed in or read from L0 memory (such as SRAM). In one embodiment, the filter weights may be previously loaded from L2 memory to L1 memory, and then from L1 memory to L0 memory. (See
Alternatively, in one embodiment, the filter weights are stored in memory (e.g., L2 memory) in an FP16 format (16 bits for sign, exponent, fraction). The filter weight values, in this embodiment, are read from memory (L2—SRAM memory) and directly stored in the L1 and L0 memory levels. Thereafter, the filter weights are loaded into the filter weight register “F” and are available/accessible to the multiplier circuitry to implement the multiplication operation of the execution circuitry/process of the data processing circuitry. In yet another embodiment, the filter weight values are read from memory (e.g., L2 or L1—SRAM memory) and directly loaded into the filter weight register “F” for use by the multiplier circuit array of the execution circuitry/process of the MAC processors.
Notably, other numerical precisions and/or data formats may be employed for the various values which are to be processed—the values that are described in this exemplary embodiment represent the precision (e.g., minimum precision) that is practical for a floating point format.
With continued reference to
In one embodiment, a plurality of outputs of the accumulator circuit may be accumulated. That is, after each result “Y” has accumulated a plurality of products, the accumulation totals may be parallel-loaded into the “MAC-SO” registers. Thereafter, the accumulation data may be serially shifted out (i.e., output) during a subsequent or the next execution sequence (e.g., to memory).
Notably, with continued reference to
With reference to
In this exemplary embodiment, during processing, the Yijlk MAC values are rotated through all 64 MAC processing circuits during the 64 execution cycles after being loaded from the Yijk shifting chain (see YMEM memory), and will be unloaded with the same shifting chain. Further, “m” (e.g., 64 in the illustrative embodiment) MAC processing circuits in the execution pipeline operate concurrently whereby the multiplier-accumulator processing circuits perform m×m (e.g., 64×64) multiply-accumulate operations in each m (e.g., 64) cycle interval (here, a cycle may be nominally 1 ns). Thereafter, a next set of input pixels/data (e.g., 64) is shifted-in and the previous output pixels/data is shifted-out during the same m (e.g., 64) cycle interval. Notably, each m (e.g., 64) cycle interval processes a Dd/Yd (depth) column of input and output pixels/data at a particular (i,j) location (the indexes for the width Dw/Yw and height Dh/Yh dimensions). The m (e.g., 64) cycle execution interval is repeated for each of the Dw*Dh depth columns for this stage. In this exemplary embodiment, the filter weights or weight data are loaded into memory (e.g., the L1/L0 SRAM memories) from, for example, an external memory or processor before the stage processing started (see, e.g., the '345, '212 and '082 applications). In this particular embodiment, the input stage has Dw=512, Dh=256, and Dd=128, and the output stage has Yw=512, Yh=256, and Yd=64. Note that only 64 of the 128 Dd input are processed in each 64×64 MAC execution operation.
With continued reference to
Indeed, the method illustrated in
With reference to
With continued reference to
In one embodiment, the multiplier circuit array of each MAC receives input data (Dijk) and an associated filter weight Fkl (e.g., from memory—see, e.g.,
In this embodiment, the linearly connected MAC pipeline is configured such that Yijl data is fixed in a MAC processor during execution whereas the input data (Dijk data) rotates during execution through or between the MAC processors. That is, the Yijl accumulation values are not output (moved or rotated), during or after each cycle of the execution sequence (i.e., set of associated execution cycles), to the immediately following MAC and employed in the accumulation operation. With that in mind, the accumulator circuit receives the previous accumulation value output therefrom (see MAC_r[p]). Thus, in each execution cycle, the Fkl value in the D_r[p] register is multiplied by the Dijk value in the D_i[p] register, via the multiplier circuit array, and the result/product is loaded in the MULT_r[p] register. In the next pipeline cycle this D*F value is added to the Yijl accumulation value in the local MAC_r[p] register (in the same or associated MAC processor) and the result is loaded in the MAC_r[p] register. This is repeated for the execution cycles of the current execution sequence. Here, the immediately previous accumulation value are provided to the accumulator circuit and employed in the accumulation operation.
With continued reference to
In this embodiment, the MACs are configured such that the output of the accumulator circuit (“ADD”) is input back into the accumulator circuit (“ADD”) of the associated MAC (see, MAC_r[p]) and employed in the accumulation operation. Moreover, the output of each accumulator circuit (“ADD”) of the MACs is not rotated, transferred or moved to the immediately following MAC of the linear processing pipeline (compare
The MAC processors also include a shifting chain (MAC_SO[p]) for preloading the Yijl sum. In this embodiment, each MAC also uses the shifting chain (MAC_SO[p]) for unloading or outputting the Yijl sums (final accumulation values). The previous Yijl sums are shifted out (i.e., rotated, transferred) while the next Yijl sums are shifted in. Notably, in this embodiment, the Yijl shifting chain (MAC_SO[p]) may be employed to both preloading and unloading. Thus, in this embodiment, the linearly connected pipeline architecture may be characterized by Yijl data that is fixed in place during execution and Dijk data that rotates during execution. That is, the input data values (Dijk data values) rotate through all of the MAC processors or MACs during the associated execution cycles of the execution sequence after being loaded from the Dijk shifting chain. As noted above, in this embodiment, the Yijlk accumulation values will be held or maintained in a MAC processor during the associated execution cycles of the execution sequence—after being loaded from the Yijk shifting chain and the final Yijlk accumulation values will be unloaded via the same shifting chain.
Notably, these techniques, which generalize the applicability of the MAC execution pipeline (in this exemplary embodiment, 64×64) of, for example,
With reference to
With reference to
For the purposes of illustration, a 32 bit floating point data format (FP32) is often employed to explain or describe certain circuitry, operation thereof, and/or methods of certain aspects of certain features of the present inventions including in the context of the multiply operation and type of multiplier circuit of the multiplier circuit array. Similarly, an integer data format (e.g., an 8 bit integer data format (INT8), a 16 bit integer data format (INT16), a 24 bit integer data format (INT24), and 32 bit integer data format (INT32)) is often employed to explain or describe certain circuitry, operation thereof, and/or methods of certain aspects of certain features of the present inventions including in the context of the multiply operation and type of multiplier circuit of the multiplier circuit array. The inventions, including the embodiments thereof described and/or illustrated herein, are not limited to (i) particular floating point data format(s), particular fixed point data format(s), precisions thereto, block/data width, data path width, bandwidths, values, processes and/or algorithms illustrated.
With reference to
Notably, each multiplier circuit of the multiplier circuit array may be a partial multiplier circuit including only certain or selected circuitry of a complete and fully functional/capable multiplier circuit (e.g., omission of: (i) circuitry to perform the multiply operation corresponding to sign fields and exponent fields of the operands or (ii) circuitry to perform the multiply operation corresponding to the fraction fields of the operands). Here, the multiplier circuit A processes or performs a portion(s) of the multiply operation (i.e., multiply operations of the values of the sign bit fields and the exponent fields of the input data and filter weights) and may not include circuitry to perform the portion(s) of the multiply operation that is performed by multiplier circuit B (i.e., circuitry associated with the multiply operation of the values of fraction fields of, for example, the input data and filter weights). Similarly, the multiplier circuit B processes or performs a different portion(s) of the multiply operation (i.e., values of fraction fields of, for example, the input data and filter weights) and may not include circuitry to perform the portion(s) of the multiply operation that is performed by multiplier circuit A (i.e., multiply operations of the values of the sign bit fields and the exponent fields of, for example, the input data and filter weights).
With continued reference to
The multiply block/operation may be separated, divided and/or broken into two pieces: a 24×24 multiply core and everything else. The interconnection bus (e.g., a 48 conductor bus P[47:0]) provides communication between the pieces of the multiply block to facilitate communication of the product of the fraction field (via the integer type multiplier—multiplier circuit B) to the multiplier circuit A. In the exemplary embodiment, the 32×32 bit multiply core may also be a 24×24 bit multiply core employed for the fractional field multiplication (e.g., circuitry that is configured to perform two's complement multiplication). Where a 32×32 bit multiply core of the integer type multiplier circuit (multiplier circuit B) is employed, the lower or LSB 48 bits of the 64 bit product are routed to the interconnection bus (via the connection port P[47:0]).
in one embodiment, the multiply operation (FP32) implemented by the multiplier circuit array begins by loading the two 32 bit operands (from input bus 1 and input bus 2) into the multiplier circuit A and simultaneously loading the two operands into multiplier circuit B. Where the multiplier circuit B is a 32 bit integer multiplier type, a constant 9h′001 is input into the MSBs. The multiplier circuit B multiplies the two fractional field of the operands and generates a product thereof. In addition, the multiplier circuit A processes the 1b sign and 8b exponent fields.
With continued reference to
In addition to the multiply core to perform operations with respect to the fractional fields of the operands, other logic circuitry may be disposed in multiplier circuit B (versus in multiplier circuit A). For example, with reference to
While the embodiment illustrated in
With reference to
In one embodiment, the rounding circuitry may also be disposed in multiplier circuit B. That is, either/both multiplier circuits of the multiplier circuit array may include rounding circuitry to round the resultant product of the multiply operation to generate or provide a predetermined bit length, size or precision of the fraction field of the output data. For example, where the output data includes a floating point data format having a bit length, size or precision of 32 bits, the multiply operation of the two operands may generate more bits corresponding to the fraction field and suitable or defined for 32 bit floating point data. Here, the rounding circuitry generates or provides rounding data which is employed to round the fraction field to an appropriate bit length, size or precision corresponding to the data format (e.g., in the context of a 32 bit floating point data format, a 23 bit fraction field). Thus, in one embodiment, the rounding circuitry generates data/information to round the resultant product of the fraction fields. In one embodiment, the rounding circuitry may be separated into a plurality of segments so that the resultant/product may be rounded to a fraction size of bits (e.g., 8, 16, 24) that corresponds to a result of multiplier circuitry and/or the size or width of the output data. (See,
Notably, the embodiment of
With reference to
As noted above, each multiplier circuit of the multiplier circuit array may be a complete and fully functional/capable multiplier circuit or may be a partial multiplier circuit including only certain or selected circuitry of a complete and fully functional/capable multiplier circuit (e.g., omission of: (i) circuitry to perform the multiply operation corresponding to sign fields and exponent fields of the operands or (ii) circuitry to perform the multiply operation corresponding to the fraction fields of the operands). For example, with reference to
With reference to
In one embodiment, at least one of the multiplier circuits of the multiply circuit array outputs data of the product resulting from the multiply operation (e.g., the output of the multiply operation of the values of fraction fields of, for example, the input data and filter weights) to another multiplier circuit of the multiply circuit array. Notably, in one embodiment, the conductors may also communicate control or control type data (e.g., rounding information/data, outputs from fraction detection logic to detect, for example, special values/operands such as ZRO (zero), NAN (not a number), EOVFL (exponent overflow), EUNFL (exponent underflow) and/or INF (infinity)).
With reference to
With continued reference to
The multiplier circuit array of
With reference to
The configuration of the multiplier circuit array may be user or system defined and/or may be one-time programmable/configurable (e.g., at manufacture) or more than one-time programmable/configurable (e.g., (i) at or via power-up, start-up or performance/completion of the initialization sequence/process sequence, and/or (ii) in situ (i.e., during operation of the integrated circuit), at manufacture, and/or at or during power-up, start-up, initialization, re-initialization, configuration, re-configuration or the like. In one embodiment, control circuitry is employed to program/configure the multiplier circuit array including the plurality of multiplier circuits. The control circuitry, in one embodiment, programs/configures the multiplier circuit array one-time; in another embodiment, the control circuitry programs/configures the multiplier circuit array more than one-time (i.e., multiple times). For example, the control circuity may receive select and/or enable signals from internal or external circuitry (i.e., external to the one or more integrated circuits—for example, a host computer/processor) including one or more data storage circuits (e.g., one or more memory cells, register, flip-flop, latch, block/array of memory), one or more input pins/conductors, a look-up table LUT (of any kind), a processor or controller and/or discrete control logic. The control circuitry, in response thereto, may employ such signal(s) to enable or disable selected multiplier circuits of the multiplier circuit array and thereby configure the multiplier circuitry of, for example, the MAC or MACs of a data processing pipeline, to implement the multiply operations. The control circuitry may configure the multiplier circuitry in situ and/or at or during power-up, start-up, initialization, re-initialization, configuration, re-configuration or the like. Indeed, in one embodiment, control circuitry may evaluate the input data and, based thereon, implement or select a configuration of the multiplier circuit array to provide the appropriate configuration to implement or provide a predetermined precision and data format of the resultant multiplication product (output data).
For example, with reference to
In another embodiment, where the precision and data format of the input data and filter weights have a 24 bit floating point data format, control circuitry may enable multiplier circuits A and B of the multiplier circuit array to implement the multiply operations of the multiplier circuitry. Here, multiplier circuit A may perform or implement the multiply operation in connection with the values of the sign bit fields and the exponent fields of, for example, the input data and filter weights, and multiplier circuit B may perform or implement the multiply operation in connection with the values of fraction fields of the input data and filter weights (in this example, a 16×16 multiply operation). Notably, multiplier circuit C (in this exemplary embodiment, an 8 bit integer type multiplier circuit), does not have the capacity to efficiently multiply the 15 bit values of each fraction field of the input data and filter weights. Thus, the control circuitry enables multiplier circuits A and B (and/or disables or deselects multiplier circuit C).
As discussed above, the multiplier circuits of the multiplier circuit array may be programmed/configured via control circuitry, for example, in situ (i.e., during operation of the integrated circuit), at manufacture, and/or at or during power-up, start-up, initialization, re-initialization, configuration, re-configuration or the like. Notably, in one embodiment, interconnection bus (IB) selection circuitry (see,
As mentioned above, the multiplier circuit array of the present inventions may be incorporated and/or implemented in one or more (or all) multiplier-accumulator circuits of an execution or processing pipeline including execution circuitry employing one or more floating point data formats. In another aspect of the present inventions, the multiplier-accumulator circuit(s) may include a multiplier circuit array (which, in one embodiment, is configurable to provide a predetermined precision of the resultant multiplication product (output data)). The multiplier circuit array may include a floating point type multiplier and an integer type multiplier. The output of the multiplier circuit array, having a floating point data format, may be provided to the accumulator circuit, which is a floating point type accumulator. In one embodiment, the execution or processing pipeline includes a plurality of multiplier-accumulator circuits, each circuit including a multiplier circuit array (e.g., having an identical configuration). For example, the plurality of multiplier-accumulator circuits (each having multiplier circuit array) may be interconnected (in series) to perform the multiply and accumulate operations and/or the pipelining architecture or configuration implemented via connection of multiplier-accumulator circuits. In this pipeline architecture, for example, the plurality of multiplier-accumulator circuits may concatenate the multiply and accumulate operations of the data processing.
The multiplier circuit array may include a plurality of multiplier circuits wherein one or more of the multiplier circuit(s) is/are incorporated or embedded into another of multiplier circuit of the multiplier circuit array. For example, with reference to
Notably, the multiplier circuit array of the present inventions may be employed and/or implemented in the multiplier-accumulator circuit, MAC pipelines, and other circuitry described and/or illustrated in U.S. patent application Ser. No. 16/545,345. Here, the multiplier circuit array of the present inventions may be incorporated into or employed in the multiplier circuitry of the multiplier-accumulator circuit described and/or illustrated in the '345 application to, for example, facilitate concatenating the multiply and accumulate operations, and reconfiguring the circuitry thereof and operations performed thereby (see, e.g., the exemplary embodiments illustrated in FIGS. 1A-1C of U.S. patent application Ser. No. 16/545,345); in this way, each multiplier-accumulator circuit includes a multiplier circuit array to, for example, process data (e.g., image data) in a manner whereby the processing and operations are performed as described herein. Notably, the '345 application are incorporated by reference herein in their entirety.
Further, the multiplier circuit array of the present inventions may also be employed or be implemented in the circuitry and techniques multiplier-accumulator execution or processing pipelines (and methods of operating such circuitry) having circuitry to implement Winograd type processes to increase data throughput of the multiplier-accumulator circuit and processing—for example, as described and/or illustrated in U.S. patent application No. 16/796,111, both of which are hereby incorporated by reference in its entirety. In this regard, each multiplier-accumulator circuit described in the aforementioned '111 application, and pipeline(s) including such multiplier-accumulator circuit, may include a multiplier circuit array of the present inventions, to facilitate concurrently processing data to, for example, increase throughput of the data processing and overall pipeline.
In addition thereto, or in lieu thereof, the multiplier circuit array of the present inventions may also be employed and/or implemented in the circuitry of the multiplier-accumulator execution or processing pipelines (and methods of operating such circuitry) to process data, concurrently or in parallel, to increase throughput of the pipeline—for example, as described and/or illustrated in U.S. patent application Ser. No. 16/816,164; the '164 application are hereby incorporated by reference in its entirety. Here, a plurality of processing or execution pipelines, each pipeline having a plurality of multiplier-accumulator circuits that include a multiplier circuit array of the present inventions, may concurrently process data to, for example, increase throughput of the data processing and overall pipeline. Control or configure circuitry may be programmed to configure the multiplier-accumulator pipelines (wherein the individual multiplier-accumulator circuits include multiplier circuit array of the present inventions) to implement the concurrent and/or parallel processing techniques.
The multiplier circuit array of the present inventions may also be employed and/or implemented in the multiplier-accumulator circuits employed in the processing pipelines or architectures, and circuitry to configure and control such pipelines/architectures, described and/or illustrated in U.S. patent application Ser. No. 17/019,212. In this regard, the multiplier circuitry of the multiplier-accumulator circuits may include the multiplier circuit array described and illustrated herein; the '212 application are incorporated by reference herein in their entirety.
Moreover, the present inventions may be implemented in the circuitry, function and operation of enhancing the dynamic range of the filter weights or coefficients as described and/or illustrated in U.S. patent application Ser. No. 17/074,670. That is, the present inventions may use the circuitry and techniques to enhance the dynamic range of the filter weights or coefficients of the '670 application. Such circuitry and techniques may be implemented in connection with the multiply operations performed by the multiplier circuit array of the multiplier-accumulator circuits of the present inventions. Notably, the '670 application are incorporated herein by reference in their entirety.
Further, the multiplier-accumulator circuits (having the multiplier circuit array of the present inventions described and/or illustrated herein) may be interconnected into execution or processing pipelines as described and/or illustrated in U.S. patent application Ser. No. 17/212,411, which, as noted above, is incorporated by reference herein in its entirety. In one embodiment, the circuitry configures and controls a plurality of separate multiplier-accumulator circuits (each having a multiplier circuit array of the present inventions) or rows/banks of such multiplier-accumulator circuits (which are interconnected, for example, in series (such rows/banks thereof are referred to, at times, as clusters) to pipeline multiply and accumulate operations. In one embodiment, the plurality of multiplier-accumulator circuits (having the multiplier circuit array) may include a plurality of registers (including a plurality of shadow registers) wherein the circuitry also controls such registers to implement or facilitate the pipelining of the multiply and accumulate operations performed by the multiplier-accumulator circuits to increase throughput of the multiplier-accumulator execution or processing pipelines in connection with processing the related data (e.g., image data). (See, e.g., '345, '212 and '082 applications).
In another embodiment, the interconnection of the pipeline or pipelines (each including a plurality of multiplier-accumulator circuits (having the multiplier circuit array of the present inventions described and/or illustrated herein) may be configurable or programmable to provide different forms of pipelining, as described and/or illustrated in U.S. patent application Ser. No. 17/212,411). Here, the pipelining architecture provided by the serial interconnection of the plurality of multiplier-accumulator circuits (having the multiplier circuit array of the present inventions described and/or illustrated herein) may be controllable or programmable. In this way, a plurality of multiplier-accumulator circuits, connected in series wherein each circuit having a multiplier circuit array of the present inventions described and/or illustrated herein, may be configured and/or re-configured to form or provide the desired processing pipeline(s) to process data (e.g., image data). For example, with reference to the '411 application, in one embodiment, control/configure circuitry may configure or determine the multiplier-accumulator circuits having multiplier circuit array described herein, or rows/banks of interconnected multiplier-accumulator circuits having a multiplier circuit array described herein are interconnected (in series) to perform the multiply and accumulate operations and/or the pipelining architecture or configuration implemented via connection of multiplier-accumulator circuits (or rows/banks of interconnected multiplier-accumulator circuits). Thus, in one embodiment, the control/configure circuitry configures or implements an architecture of the execution or processing pipeline by controlling or providing connection(s) between such multiplier-accumulator circuits and/or such rows of interconnected multiplier-accumulator circuits—each of which include one or more multiplier circuit array embodiments described herein.
Moreover, the multiplier-accumulator circuits (having the multiplier circuit array of the present inventions described and/or illustrated herein) may be employed in the processing pipelines as described and/or illustrated in U.S. patent application Ser. Nos. 17/376,415 and 17/391,082; the '415 and '082 applications are incorporated by reference herein in its entirety. In short, the circuitry and techniques to implement the programmable granularity circuitry and techniques described and/or illustrated in the '415 application as well as the filter circuitry and techniques described and/or illustrated in the '082 application may be modified to employ the multiplier-accumulator circuits having one or more multiplier circuit array embodiments described and/or illustrated herein. Thus, in one embodiment, multiplier-accumulator circuits having one or more multiplier circuit array embodiments are implemented in the circuitry and techniques described and/or illustrated in the '810 and/or '979 applications.
There are many inventions described and illustrated herein. While certain embodiments, features, attributes and advantages of the inventions have been described and illustrated, it should be understood that many others, as well as different and/or similar embodiments, features, attributes and advantages of the present inventions, are apparent from the description and illustrations. As such, the embodiments, features, attributes and advantages of the inventions described and illustrated herein are not exhaustive and it should be understood that such other, similar, as well as different, embodiments, features, attributes and advantages of the present inventions are within the scope of the present inventions.
Indeed, the present inventions are neither limited to any single aspect nor embodiment thereof, nor to any combinations and/or permutations of such aspects and/or embodiments. Moreover, each of the aspects of the present inventions, and/or embodiments thereof, may be employed alone or in combination with one or more of the other aspects of the present inventions and/or embodiments thereof.
Moreover, although several of the exemplary embodiments and features of the inventions are described and/or illustrated in the context of certain data type and bit size or length of the core(s) of the multiplier circuit(s) (e.g., floating point format (FPxx) and/or integer format (INTxx), the embodiments and inventions are applicable to other formats, precisions sizes and/or lengths. For the sake of brevity, those other formats, precisions or lengths will not be illustrated separately but will be quite clear to one skilled in the art based on, for example, this application. The present inventions are not limited to (i) particular floating point format(s) and lengths thereof, particular fixed point format(s) and lengths thereof, operations (e.g., addition, subtraction, etc.), block/data width or length, data path width, bandwidths, values, processes and/or algorithms illustrated, nor (ii) the exemplary logical or physical overview configurations, exemplary module/circuitry configuration and/or exemplary Verilog code, nor the bit sizes of the cores of the multiplier circuits. The embodiments set forth herein are merely examples of the present inventions.
Further, in one embodiment, the execution pipelines, including MACs having the multiplier circuit arrays, may concurrently process data to increase throughput of the pipeline. For example, in one implementation, the present inventions may include a plurality of separate MAC and a plurality of registers (including, in one embodiment, a plurality of shadow registers) that facilitate pipelining of the multiply and accumulate operations wherein the circuitry of the execution pipelines concurrently process data to increase throughput of the pipeline.
In certain embodiment, conversion circuitry may be employed to convert the data format to a suitable or a predetermined format (e.g., from FP 8 to FP16; or from FP32 to FP24). For example, if the input data (e.g., image data) have been generated by an earlier filtering operation and/or stored in memory (e.g., SRAM such as L2 memory) after generation/acquisition, such data may be in a 24 bit floating point format (FP24—24 bits for sign, exponent, fraction). Under this circumstance, in one embodiment, the data/pixels may be converted (e.g., on-the-fly—i.e., immediately prior to such data processing) into an FP16 format, which may be the format employed by the multiplier circuitry in connection with the multiplication operation. Such circuitry may employ the data conversion circuitry described and/or illustrated in U.S. patent application Ser. No. 17/313,037 (see, e.g., FIG. 2A and associated text and related illustrations), and U.S. Provisional Application Nos. 63/173,948 (see, e.g., FIGS. 4A-11 and associated text) and 63/189,804 (see, e.g., FIGS. 4A-9 and associated text).
Further, control circuitry to implement the configuration of the multiplier circuit array may be partially or entirely resident on the integrated circuit of the processing circuitry or external thereto (e.g., in a host computer or on a different integrated circuit from the MAC circuitry and execution pipelines). As noted above, the configuration of the multiplier circuit array of, for example, the MACs and/or MACs of the execution pipelines, may be user or system defined and/or may be one-time programmable (e.g., at manufacture) or more than one-time programmable (e.g., (i) at or via power-up, start-up or performance/completion of the initialization sequence/process sequence, and/or (ii) in situ (i.e., during operation of the integrated circuit), at manufacture, and/or at or during power-up, start-up, initialization, re-initialization, configuration, re-configuration or the like. In one embodiment, control circuitry may evaluate the input data and, based thereon, implement or select a configuration of the multiplier circuit array (e.g., based on the data format and/or precision of the input data). In response, the control circuitry may receive configuration instruction signals from internal or external circuitry (i.e., external to the one or more integrated circuits—for example, a host computer/processor) including one or more data storage elements (e.g., one or more memory cells, register, flip-flop, latch, block/array of memory), one or more input pins/conductors, a look-up table LUT (of any kind), a processor or controller and/or discrete control logic. The control circuitry, in response thereto, may employ such signal(s) to implement the selected/defined configuration (e.g., in situ and/or at or during power-up, start-up, initialization, re-initialization, configuration, re-configuration or the like) of the multiplier circuit array.
Importantly, the present inventions are neither limited to any single aspect nor embodiment thereof, nor to any combinations and/or permutations of such aspects and/or embodiments. Moreover, each of the aspects of the present inventions, and/or embodiments thereof, may be employed alone or in combination with one or more of the other aspects of the present inventions and/or embodiments thereof.
Further, although the memory cells in certain embodiments are illustrated as static memory cells or storage elements, the present inventions may employ dynamic or static memory cells or storage elements. Indeed, as stated above, such memory cells may be latches, flip/flops or any other static/dynamic memory cell or memory cell circuit or storage element now known or later developed.
Notably, various circuits, circuitry and techniques disclosed herein may be described using computer aided design tools and expressed (or represented), as data and/or instructions embodied in various computer-readable media, in terms of their behavioral, register transfer, logic component, transistor, layout geometries, and/or other characteristics. Formats of files and other objects in which such circuit, circuitry, layout and routing expressions may be implemented include, but are not limited to, formats supporting behavioral languages such as C, Verilog, and HLDL, formats supporting register level description languages like RTL, and formats supporting geometry description languages such as GDSII, GDSIII, GDSIV, CIF, MEBES and any other formats and/or languages now known or later developed. Computer-readable media in which such formatted data and/or instructions may be embodied include, but are not limited to, non-volatile storage media in various forms (e.g., optical, magnetic or semiconductor storage media) and carrier waves that may be used to transfer such formatted data and/or instructions through wireless, optical, or wired signaling media or any combination thereof. Examples of transfers of such formatted data and/or instructions by carrier waves include, but are not limited to, transfers (uploads, downloads, e-mail, etc.) over the Internet and/or other computer networks via one or more data transfer protocols (e.g., HTTP, FTP, SMTP, etc.).
Indeed, when received within a computer system via one or more computer-readable media, such data and/or instruction-based expressions of the above described circuits may be processed by a processing entity (e.g., one or more processors) within the computer system in conjunction with execution of one or more other computer programs including, without limitation, net-list generation programs, place and route programs and the like, to generate a representation or image of a physical manifestation of such circuits. Such representation or image may thereafter be used in device fabrication, for example, by enabling generation of one or more masks that are used to form various components of the circuits in a device fabrication process.
Moreover, the various circuits, circuitry and techniques disclosed herein may be represented via simulations using computer aided design and/or testing tools. The simulation of the circuits, circuitry, layout and routing, and/or techniques implemented thereby, may be implemented by a computer system wherein characteristics and operations of such circuits, circuitry, layout and techniques implemented thereby, are imitated, replicated and/or predicted via a computer system. The present inventions are also directed to such simulations of the inventive circuits, circuitry and/or techniques implemented thereby, and, as such, are intended to fall within the scope of the present inventions. The computer-readable media corresponding to such simulations and/or testing tools are also intended to fall within the scope of the present inventions.
Notably, reference herein to “one embodiment” or “an embodiment” (or the like) means that a particular feature, structure, or characteristic described in connection with the embodiment may be included, employed and/or incorporated in one, some or all of the embodiments of the present inventions. The usages or appearances of the phrase “in one embodiment” or “in another embodiment” (or the like) in the specification are not referring to the same embodiment, nor are separate or alternative embodiments necessarily mutually exclusive of one or more other embodiments, nor limited to a single exclusive embodiment. The same applies to the term “implementation.” The present inventions are neither limited to any single aspect nor embodiment thereof, nor to any combinations and/or permutations of such aspects and/or embodiments. Moreover, each of the aspects of the present inventions, and/or embodiments thereof, may be employed alone or in combination with one or more of the other aspects of the present inventions and/or embodiments thereof. For the sake of brevity, certain permutations and combinations are not discussed and/or illustrated separately herein.
Further, an embodiment or implementation described herein as “exemplary” is not to be construed as ideal, preferred or advantageous, for example, over other embodiments or implementations; rather, it is intended convey or indicate the embodiment or embodiments are example embodiment(s).
Although the present inventions have been described in certain specific aspects, many additional modifications and variations would be apparent to those skilled in the art. It is therefore to be understood that the present inventions may be practiced otherwise than specifically described without departing from the scope and spirit of the present inventions. Thus, embodiments of the present inventions should be considered in all respects as illustrative/exemplary and not restrictive.
The terms “comprises,” “comprising,” “includes,” “including,” “have,” and “having” or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, circuit, article, or apparatus that comprises a list of parts or elements does not include only those parts or elements but may include other parts or elements not expressly listed or inherent to such process, method, article, or apparatus. Further, use of the terms “connect”, “connected”, “connecting” or “connection” herein should be broadly interpreted to include direct or indirect (e.g., via one or more conductors and/or intermediate devices/elements (active or passive) and/or via inductive or capacitive coupling)) unless intended otherwise (e.g., use of the terms “directly connect” or “directly connected”).
The terms “a” and “an” herein do not denote a limitation of quantity, but rather denote the presence of at least one of the referenced item. Further, the terms “first,” “second,” and the like, herein do not denote any order, quantity, or importance, but rather are used to distinguish one element/circuit/feature from another.
In addition, the term “integrated circuit” means, among other things, any integrated circuit including, for example, a generic or non-specific integrated circuit, processor, controller, state machine, gate array, SoC, PGA and/or FPGA. The term “integrated circuit” also means, for example, a processor, controller, state machine and SoC—including an embedded FPGA.
Further, the term “circuitry”, means, among other things, a circuit (whether integrated or otherwise), a group of such circuits, one or more processors, one or more state machines, one or more processors implementing software, one or more gate arrays, programmable gate arrays and/or field programmable gate arrays, or a combination of one or more circuits (whether integrated or otherwise), one or more state machines, one or more processors, one or more processors implementing software, one or more gate arrays, programmable gate arrays and/or field programmable gate arrays. The term “data” means, among other things, a current or voltage signal(s) (plural or singular) whether in an analog or a digital form, which may be a single bit (or the like) or multiple bits (or the like).
Notably, the term “MAC circuit” means a multiplier-accumulator circuit of the multiplier-accumulator circuitry of the multiplier-accumulator pipeline. For example, a multiplier-accumulator circuit is described and illustrated in the exemplary embodiment of FIGS. 1A-1C of U.S. patent application Ser. No. 16/545,345, and the text associated therewith. In the claims, the term “MAC circuit” means a multiply-accumulator circuit, for example, like that described and illustrated in the exemplary embodiment of FIGS. 1A-1C, and the text associated therewith, of U.S. patent application Ser. No. 16/545,345. Notably, however, the term “MAC circuit” is not limited to the particular circuit, logical, block, functional and/or physical diagrams, block/data width, data path width, bandwidths, and processes illustrated and/or described in accordance with, for example, the exemplary embodiment of FIGS. 1A-1C of U.S. patent application Ser. No. 16/545,345.
Notably, the limitations of the claims are not written in means-plus-function format or step-plus-function format. It is applicant's intention that none of the limitations be interpreted pursuant to 35 USC § 112, ¶6 or § 112(f), unless such claim limitations expressly use the phrase “means for” or “step for” followed by a statement of function and is void of any specific structure.
Again, there are many inventions described and illustrated herein. While certain embodiments, features, attributes and advantages of the inventions have been described and illustrated, it should be understood that many others, as well as different and/or similar embodiments, features, attributes and advantages of the present inventions, are apparent from the description and illustrations.
This non-provisional application claims priority to and the benefit of U.S. Provisional Application No. 63/120,498, entitled “Multiplier Circuitry having Multiplier Circuit Array, MAC and MAC Pipeline including Same, and Methods of Configuring Same”, filed Dec. 2, 2020. The '498 provisional application is hereby incorporated herein by reference in its entirety.
Number | Date | Country | |
---|---|---|---|
63120498 | Dec 2020 | US |