Examples of the present disclosure generally relate to one-dimensional convolution by binary segmentation on DSP slices.
Convolution is a mathematical operation on two functions that produces a third function that expresses how the shape of one is modified by the other. The term convolution refers to both the result function and to the process of computing it. Convolution is defined as the integral of the product of the two functions after one is reflected about the y-axis and shifted. The choice of which function is reflected and shifted before the integral does not change the integral result. Some features of convolution are similar to cross-correlation.
In a discrete one-dimensional convolution, a stream of values is convolved with a finite set of values (i.e., a kernel). One-dimensional discrete linear convolution (and cross correlation) can be described as polynomial products between a typically short kernel vector, mapped to a polynomial b(r) and a typically long or infinite data input stream mapped to a polynomial x(r).
Applications of convolution include artificial intelligence/machine learning. or AI/ML (e.g., convolutional neural networks), image processing, radio signal processing, probability, statistics, acoustics, spectroscopy, geophysics, engineering, physics, differential equations, and other signal processing that can leverage convolutional neural networks. Discrete convolution among integer values is particularly relevant for quantized convolutional neural networks, which is a class of convolutional neural networks that leverages inputs and weights quantized to integer values.
Techniques for one-dimensional convolution among integers by binary segmentation on DSP slices are described. One example is an integrated circuit (IC) device that includes pre-processing circuitry having concatenation circuitry that concatenates first and second sets of binary integer values as respective first and second vectors, and an integer multiplier circuit that includes first and second inputs coupled to an output of the pre-processing circuitry.
Another example described herein is an IC device that includes a digital signal processor (DSP) having a plurality of multiplier circuits configured to compute a linear convolution of a kernel and a data stream over multiple cycles.
Another example described herein is an IC device that includes pre-processing circuitry that encodes elements of a kernel as kernel polynomials, and encodes segments of a data stream as data polynomials. The IC device further includes a plurality of multiplier circuits that multiply respective ones of the kernel polynomials and respective ones of the data polynomials to provide a polynomial product. The IC device further includes post-processing circuitry that slices outputs of the multiplier circuits based on a scaling factor to segregate terms of the polynomial products of the multiplier circuits, aggregates like-terms of the polynomial products of the multiplier circuits and terms of polynomial products of the multiplier circuits retained from one or more prior cycles, based on exponents of the scaling factor associated with the respective terms, outputs a set of like-terms when processing of the associated segments of the data stream is complete, and retains terms for which processing of the associated segments of the data stream is incomplete.
So that the manner in which the above recited features can be understood in detail, a more particular description, briefly summarized above, may be had by reference to example implementations, some of which are illustrated in the appended drawings. It is to be noted, however, that the appended drawings illustrate only typical example implementations and are therefore not to be considered limiting of its scope.
To facilitate understanding, identical reference numerals have been used, where possible, to designate identical elements that are common to the figures. It is contemplated that elements of one example may be beneficially incorporated in other examples.
Various features are described hereinafter with reference to the figures. It should be noted that the figures may or may not be drawn to scale and that the elements of similar structures or functions are represented by the same reference numerals throughout the figures. It should be noted that the figures are only intended to facilitate the description of the features. They are not intended as an exhaustive description of the features or as a limitation on the scope of the claims. In addition, an illustrated example need not have all the aspects or advantages shown. An aspect or an advantage described in conjunction with a particular example is not necessarily limited to that example and can be practiced in any other examples even if not so illustrated, or if not so explicitly described.
Embodiments herein describe one-dimensional convolution by binary segmentation on DSP slices.
Embodiments herein further describe application of binary segmentation to map a 1-D integer convolution to integer multiplication.
In an embodiment, binary segmentation techniques and slicing techniques are applied to wide integer multiplier circuits to perform parallel computations of multiple smaller products that contribute to the dot products of a linear convolution or cross correlation. Techniques disclosed herein may be used to exploit tiling of a multiplier matrix in both spatial dimensions to achieve a denser operational utilization that grows quadratically with the width of the accumulation data path. Techniques disclosed herein may be adapted and mapped efficiently to DSP slices.
Embodiments herein address various technical challenges, including application of binary segmentation techniques to signed data types, composition of arbitrarily wide DSP compute slices with a constant bounded critical path, and exploitation of binary segmentation techniques in multidimensional convolutions. Embodiments herein further describe techniques for updating kernel weights within DSP slices, in parallel with ongoing computations, without use of on-chip buffer memory external of the DSP slices.
Embodiments herein may be useful to accelerate convolution processing.
Multiplier circuit 104 receives vectors 110 and 112 at respective inputs 114 and 116, multiplies vectors 110 and 112, and provides a resultant product 120 at an output 118. Vectors 110 and 112 may represent polynomials and product 120 may represent a product of the polynomials (i.e. convolution or cross-correlation).
In an embodiment, multiplier circuit 104 is designed to compute a product of two high-precision numbers (i.e., relatively large integer values, or wide integers), and DSP 100 is configured to use multiplier circuit 104 to compute a 1D convolution of lower-precision numbers (i.e., relatively small integer values). As an example, and without limitation, kernel elements 106 and/or binary integer values 108 may be as small as 2, 3, or 4 bits, and vector 110 and/or vector 112 may include as few as two concatenated binary integer values.
DSP 100 may represent, for example and without limitation, a DSP48 arithmetic logic unit (ALU) or a DSP58 ALU developed by Xilinx, Inc. The DSP48 ALU includes an 18-bit by 25-bit signed multiplier, which may serve as multiplier circuit 104. The DSP58 ALU includes a 27×24 bit multiplier circuit, which may serve as multiplier circuit 104. Other DSP types may be used.
DSP 100 may be embedded within a fabric of a field-programmable gate array (FPGA).
At 302, concatenation circuitry 103 concatenates first and second sets of binary integer values 106 and 108 to provide respective vectors 110 and 112. In the example of
Further in
At 304, multiplier circuit 104 multiplies vectors 110 and 112 (i.e., (rb1+b0)×(rx1+x0)). In
Since vectors 110 and 112 represent polynomials, product 120 represents a convolution of vectors 110 and 112. In the example of
Method 300 may be performed or repeated for additional binary integer values 106 and/or additional binary integer values 108, such as described below.
Linear convolution (and cross correlation) can be described as polynomial products between a relatively short kernel (i.e., weights bi) vector, mapped to a polynomial b(r), and a relatively long or even infinite data stream xi, mapped to a polynomial x(r). Powers of radix r encode positions of the components represented by the corresponding coefficients:
After potentially reversing the order of the kernel components, both linear convolution (and cross correlation) ultimately yield a computation of a long or infinite band structure.
For a sufficiently large choice of r=2L, the kernel (r) and segments of the input (r) can be mapped to wide integer inputs of a single wide integer multiplier circuit for computing multiple rows of the computation. For unsigned data, assembly of the input integers amounts to plain concatenation, such as illustrated in the examples above. The resultant product can be sliced or divided into columns, provided lane width L is able to accommodate the maximum number of bits of the dot product result. Completing an output segment of the same length as the input segment processed in each step, the partial results in higher lanes are continued in the next computation step. If a multiply-accumulate operation is available, the continuation may be accomplished by right-shifted partial results and inserting the right-shifted partial results to the additive input in the next cycle. Conventional techniques for accumulating and tracking cross-contamination across different virtual lanes may be employed. DSP 100 may extend conventional overlap-and-add (OLA) techniques and/or conventional binary segmentation techniques to linearly convolve kernels of arbitrary sizes on streaming data.
In the example of
In an embodiment, DSP 100 processes a long and potentially unbounded stream of data (xi) bounded kernel (bi), examples of which are provided below.
At 602, at the outset of a first cycle, C0, selector circuitry 406 selects kernel elements, or weights b0 and b1.
At 604, segmentation circuitry 408 segments binary integer values x0 and x1 from data stream x(i) 404.
At 606, concatenation circuitry 103 concatenates kernel elements b0 and b1 as vector 110, and concatenates segmented binary integer values x0 and x1 as vector 112, as illustrated in
At 608, multiplier circuit 104 multiplies vectors 110 and 112, such as described above with reference to 304 in
At 610, slicer circuitry 412 slices product 120 for cycle C0 based on scaling factor r. In
For cycle C0, 612 is not applicable.
At 614, accumulator circuitry 414 retains any portion of product 120 of cycle C0 that represents an incomplete portion of the linear convolution. In
At 616, post-processing circuitry 410 outputs completed terms of the linear convolution at an output 416. In
Processing returns to 602 for cycle C1, where selector circuitry 406 selects kernel elements b2 and b3.
At 604, segmentation circuitry 408 again provides binary integer values x0 and x1 to concatenation circuitry 103.
At 606, concatenation circuitry 103 concatenates kernel elements b2 and b3. as vector 110, and concatenates segmented binary integer values x0 and x1 as vector 112, as illustrated in
At 608, multiplier circuit 104 multiplies vectors 110 and 112.
At 610, slicer circuitry 412 slices product 120 for cycle C1 based on scaling factor r (i.e., LSBs r2·b2x0, ISBs r3·(b3x0+b2x1), and MSBs r4·b3x1).
For cycle C1, 612 is not applicable.
At 614, accumulator circuitry 414 retains any portion of product 120 of cycle C1 that represents an incomplete portion of the linear convolution. In
Since there are no completed terms for the linear convolution during cycle C1, post-processing circuitry 410 does not output any completed terms for cycle C1 at 616.
Processing returns to 602 for cycle C2, where selector circuitry 406 again selects kernel weights b0 and b1.
At 604, segmentation circuitry 408 segments binary integer values x2 and x3 from data stream x(i) 404
At 606, concatenation circuitry 103 concatenates weights b0 and b1 as vector 110, and concatenates segmented binary integer values x2 and x3 as vector 112, as illustrated in
At 608, multiplier circuit 104 multiplies vectors 110 and 112.
At 610, slicer circuitry 412 slices product 120 for cycle C2 (i.e., LSBs r2·b0x2, ISBs r3·(b1x2+r3·b0x3) and MSBs r4·b1x3 in
At 612, accumulator circuitry 414 combines any retained portion of the linear convolution with segmented product 120 for cycle C2, if the retained portion completes an element of the linear convolution. As illustrated
At 614, accumulator circuitry 414 retains any portion of product 120 of cycle C2 that represents an incomplete portion of the linear convolution. In
At 616, post-processing circuitry 410 outputs any completed terms of the linear convolution. In
Processing returns to 602 for cycle C3. Method 600 may be repeated as described above, indefinitely or until data stream x(i) 404 terminates.
DSP 100 may include multiple parallel multiplier circuits, such as described below with reference to
In
Further to the examples above, for a sufficiently large choice of r=2L, the kernel (r) and segments of the input (r) can be mapped to wide integers for computing multiple rows of this computation with a single wide multiplier. For unsigned data, the assembly of the input integers amounts to plain concatenation, such as described in examples above. Also, the product is easily sliced into the computed column results as long as the lane width L is able to accommodate the maximum dot product result. Completing an output segment of the same length as the input segment processed in each step, the partial results in higher lanes are continued in the next computation step. If a multiply-accumulate operation is available, this may be accomplished by inserting the right-shifted partial results to the additive input in the next round.
For signed data, the wide integer composition is no longer a plain concatenation but turns into proper addition and subtraction with potential cross-lane borrows. In the context of deep neural networks, this rarely translates into effort that needs to be handled at inference time for several reasons. One reason is that activations (inputs) predominantly remain unsigned values by being subjected to activation functions like rectified linear unit (ReLU). Another reason is that signed input values can be shifted to an unsigned range by appropriately re-adjusting the bias value that is added to the output of a convolution layer in a convolutional neural network. Another reason is that, typically, signed weights (kernel parameters) can be handled and packed arithmetically before a ML model is deployed for inference. Moreover, the wide integer output can be ensured to be sliceable when a multiply-accumulate operation underlies the compute, and all lanes starting a new column of compute (rather than continuing one) are primed with an initial value of 2L−1 rather than 0. The implied offset in the ultimate lane result is easily corrected by inverting back the topmost bit. A consequence for the course of the computation is that no borrow from negative dot product accumulations in a column is able to spill over and pollute the next computation lane.
In an embodiment, DSP slices are used to implement a computation step convolving an arbitrarily sized kernel with an input segment. Even when multiple DSP slices are used to unroll the complete kernel width, the critical path remains bounded by the equivalent of two combinatorially-chained DSP slices as the compute columns are separated logically and no column spans more than two neighboring DSP slices. An example is provided below with reference to
The number of compute lanes that fit an individual DSP slice depends on the lane width, which, in turn, is determined by bit widths of the operands. An arrangement of five parallel 10-bit lanes may be appropriate for inputs of approximately 4-bits, with an intrinsic accumulation of three compute rows. An achieved operational density of up to 9 multiply-accumulate operations (i.e., 18 Ops) per DSP slice exceeds the 4 multiply-accumulate operations (8 Ops) by more than 2×, which can be attained by guard-free single-vector packing with spill-over tracking and correction in the surrounding fabric.
Each DSP slice implements a multiply-accumulate operation on wide integers. Assuming the naming of ports as given in
Multiple lower-precision elements of the kernel and of an input stream segment are put onto these wide datapaths. Mathematically, the elements are combined via a polynomial over a radix 2L that facilitates their spacing into separable lanes.
A kernel or an input stream segment may have more elements than what the wide DSP datapath can accommodate. Then, the computation is striped across multiple DSP slices. If the element count of either kernel or input stream segment is not a full multiple of the lanes that can be accommodated on the assigned DSP datapath, the input is padded up to size. This is illustrated by the second DSP slice in the striped example mapping of
Except for the very first iteration and for a kernel size of 1, the performed multiply-accumulate operation picks up some partially computed sums from its preceding iteration. The corresponding state is communicated through a register, illustrated here as Accu. The notation Accu+ differentiates the state output of an iteration from its input.
Each iteration will start the dot product computation of new convolutional outputs. Their number is equivalent to the number of elements in the consumed input stream segment. The corresponding compute lanes are primed by a strating value. For an unsigned accumulation, this will typically be zero. For a signed accumulation, a starting value of 2L−1 is a more practical choice as it allows to avoid cross-lane borrows on the wide integer datapaths. This way, the output values remain separable without requiring a subsequent corrective computation. The priming value is referred to as o.
Note that the arrangement of the computation into columns is important in
In an embodiment, DSP 100 and/or DSP slices 900-1 and 900-2 updates kernel parameters in the background, without accessing external circuitry. In some situations, the same kemel is applied to a long input sequence or a comparatively large feature map before being preempted for computation of another channel or layer. In an embodiment, DSP 100 and/or DSP slices 900-1 and 900-2 include registers (e.g., registers B1 and B2) on an input path, which may be used as true on-chip storage devices for background updating of kernel parameters.
For example, if two kernels alternate back and forth for a long stretch of a an application, their two parameter sets can be shifted into the B1/B2 register cascade of all involved DSP slices arranged into a chain. This process may be relatively slow but extremely efficient with a single data entry point at the start of the chain. In operation, the currently active kernel is selected from B1 or B2 by a control (e.g., an INMODE control). If more kernels are rotated in the course of the application, the register cascade may still serve as a mediator between a lower-bandwidth on-chip or off-chip kernel parameter storage device and the parallel kernel application.
Techniques disclosed herein may be generalized to multi-dimensional convolutions. Some application domains, such as artificial intelligence (AI) domains. may utilize higher-dimensional convolutions. Even in these cases, advantages of binary segmentation in terms of operational density can be exploited. For example, the multi-dimensional computation may be sliced into parallel 1D-computations whose efficiently computed individual results are then aggregated across the other dimensions. Some advantages of techniques disclosed herein are strongly predicated on low-precision computations, which is a natural fit for AI domains.
Techniques disclosed herein may also be applied to digital filters.
Techniques disclosed herein may also be applied to other hardened, or fixed-circuitry wide integer data paths, such as found in arrays of AI engines (AIEs) and graphical processor units (GPUs).
Techniques disclosed herein may provide relatively significant quadratic advantages relative to traditional one-sided vectorization.
IC device 1200 may further include a field-programmable gate array (FPGA) 1204. DSP 1202 may be provided within FPGA 1204 or outside of FPGA 1204.
FPGA 1204 may include one or more of a variety of types of configurable circuit blocks, such as described below with reference to
In the example of
One or more tiles may include a programmable interconnect element (INT) 1311 having connections to input and output terminals 1320 of a programmable logic element within the same tile and/or to one or more other tiles. A programmable INT 1311 may include connections to interconnect segments 1322 of another programmable INT 1311 in the same tile and/or another tile(s). A programmable INT 1311 may include connections to interconnect segments 1324 of general routing resources between logic blocks (not shown). The general routing resources may include routing channels between logic blocks (not shown) including tracks of interconnect segments (e.g., interconnect segments 1324) and switch blocks (not shown) for connecting interconnect segments. Interconnect segments of general routing resources (e.g., interconnect segments 1324) may span one or more logic blocks. Programmable INTs 1311, in combination with general routing resources, may represent a programmable interconnect structure.
A CLB 1302 may include a configurable logic element (CLE) 1312 that can be programmed to implement user logic. A CLB 1302 may also include a programmable INT 1311.
A BRAM 1303 may include a BRAM logic element (BRL) 1313 and one or more programmable INTs 1311. A number of interconnect elements included in a file may depends on a height of the tile. A BRAM 1303 may, for example, have a height of five CLBs 1302. Other numbers (e.g., four) may also be used.
A DSP block 1306 may include a DSP logic element (DSPL) 1314 in addition to one or more programmable INTs 1311. An IOB 1304 may include, for example, two instances of an input/output logic element (IOL) 1315 in addition to one or more instances of a programmable INT 1311. An I/O pad connected to. for example, an I/O logic element 1315, is not necessarily confined to an area of the I/O logic element 1315.
In the example of
A logic block (e.g., programmable of fixed-function) may disrupt a columnar structure of configurable circuitry 1300. For example, processor 1310 spans several columns of CLBs 1302 and BRAMs 1303. Processor 1310 may include one or more of a variety of components such as, without limitation, a single microprocessor to a complete programmable processing system of microprocessor(s), memory controllers, and/or peripherals.
In
In the preceding, reference is made to embodiments presented in this disclosure. However, the scope of the present disclosure is not limited to specific described embodiments. Instead, any combination of the described features and elements, whether related to different embodiments or not, is contemplated to implement and practice contemplated embodiments. Furthermore, although embodiments disclosed herein may achieve advantages over other possible solutions or over the prior art, whether or not a particular advantage is achieved by a given embodiment is not limiting of the scope of the present disclosure. Thus, the preceding aspects, features, embodiments and advantages are merely illustrative and are not considered elements or limitations of the appended claims except where explicitly recited in a claim(s).
As will be appreciated by one skilled in the art, the embodiments disclosed herein may be embodied as a system, method or computer program product. Accordingly, aspects may take the form of an entirely hardware embodiment, an entirely software embodiment (including firmware, resident software, micro-code, etc.) or an embodiment combining software and hardware aspects that may all generally be referred to herein as a “circuit,” “module” or “system.” Furthermore, aspects may take the form of a computer program product embodied in one or more computer readable medium(s) having computer readable program code embodied thereon.
Any combination of one or more computer readable medium(s) may be utilized. The computer readable medium may be a computer readable signal medium or a computer readable storage medium. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples (a non-exhaustive list) of the computer readable storage medium would include the following: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a random access memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or Flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the context of this document, a computer readable storage medium is any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus or device.
A computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device.
Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.
Computer program code for carrying out operations for aspects of the present disclosure may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, Smalltalk, C++ or the like and conventional procedural programming languages, such as the “C” programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the latter scenario, the remote computer may be connected to the user's computer through any type of network, including a local area network (LAN) or a wide area network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet Service Provider).
Aspects of the present disclosure are described below with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments presented in this disclosure. It will be understood that each block of the flowchart illustrations and/or block diagrams, and combinations of blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer readable medium that can direct a computer, other programmable data processing apparatus, or other devices to function in a particular manner, such that the instructions stored in the computer readable medium produce an article of manufacture including instructions which implement the function/act specified in the flowchart and/or block diagram block or blocks.
The computer program instructions may also be loaded onto a computer, other programmable data processing apparatus, or other devices to cause a series of operational steps to be performed on the computer, other programmable apparatus or other devices to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide processes for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks.
The flowchart and block diagrams in the Figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods, and computer program products according to various examples of the present invention. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of instructions, which comprises one or more executable instructions for implementing the specified logical function(s). In some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems that perform the specified functions or acts or carry out combinations of special purpose hardware and computer instructions.
While the foregoing is directed to specific examples, other and further examples may be devised without departing from the basic scope thereof, and the scope thereof is determined by the claims that follow.