The present disclosure generally relates to devices, methods and apparatus for use in optical devices, and more particularly, for improving processing capabilities for computer visual applications.
MAC: multiply-accumulate
In recent years an increasing number of computer vision applications that were historically relying on hand crafted processing methods, employ deep learning methods, such as Convolutional Neural Network (“CNN”). As a result, the CNN processing power requirements for computer vision applications has naturally increased significantly. In some cases, an edge device, like a robot or AR/VR glasses or a drone, would require tens to hundreds of tera multiply-accumulate (“MAC”) operations per seconds (i.e., 1013-1014 mac/sec) for CNN-based computer vision applications, wherein a multiply-accumulate operation is a common step that computes the product of two numbers and adds that product to an accumulator. Moreover, implementations for vehicles are likely to require even a higher number of MAC operations per seconds, as these implementations typically execute multiple computer vision tasks concurrently.
Two of the major challenges which edge devices are typically facing when supporting high throughput CNN applications, are, power consumption and Synchronous dynamic random-access memory (“SDRAM”) bandwidth.
Three main classes of CNN processor architectures are known in the art:
a) Vector Dot-Product Pool (“VDPP”), exemplified in
By this implementation, convolutions are broken into multiple multiply-accumulate operations and can be represented by a sum of vector dot products (VDP). VDPP accelerators contain a pool of VDP units. A compiler is managing the breaking of convolutions into VDP operations. Data and coefficients are being fed into VDP units via buffers, and adders at the output of the VDP units are used for aggregating multiple VDP operations into full convolution calculations. There are different variants for this architecture (including for example, different VDP vector size, multiple output buffers that work with any single VDP unit, and the like).
b) Spatial Programable Processing Array (SPPA) which is exemplified in
SPPA is composed of an array of simple programable processing elements connected together via on-chip-network. Data may be forwarded between processing elements or between processing elements and peripheral buffers.
The example illustrated in
c) Systolic Arrays (“SA”). The third class of CNN processor architecture known in the art is SA which is a homogeneous network of tightly coupled processing elements. Each PE independently computes a partial result as a function of the data received from its upstream neighbors, stores the result within itself and passes it downstream. The parallel input data flows through a network of hard-wired processor nodes, combine, process, merge or sort the input data into a derived result.
Systolic Arrays are similar in some ways to SPPA, as they are also composed of an array of processing elements with inter connections. There are, however, several differences, between the two. First, the programmability level of each processing element used while implementing SA is very limited, unlike PE's in when SPPA is implemented. A PE in a SPPA will have a flexible ALU, large register file which may be used for different purposes, and rich set of instructions. A PE in a Systolic Array will have a few registers with dedicated purpose, and very limited ALU, which can carry out only a few functions. Second, when implementing SA, the connectivity between PE's is limited and hard wired. Each PE is able to send data to, and receive data from, one or few predefined neighbors. Third, the SA is designed for, and can support, a specific data flow (i.e., a specific way in which the high-level function, like convolution, is implemented). This is different from the SPPA implementation, as the latter can be configured to execute different data flows and perform different high-level tasks. Forth, when SA is implemented, the operation of each PE and data transfer between PE's are synchronized for all elements included in the array. One may consider that concept as similar to different components comprised in a mechanical engine, that perform their task in a specific timing with respect to other components. This is not necessarily the case for SPPA.
When analyzing power consumption of the different solutions, it is convenient to break the power consumption into that of each of the following consumers: arithmetic operations (mainly multiply-accumulate operations), access to local registers (i.e., registers close to the arithmetic unit), NOC transactions, access to global buffers and access to SDRAM. It has been shown that the power consumption of a single multiply-accumulate operation is similar to that of an access to a local register and also similar to a NOC transaction. It has also been shown that a global buffer access would consume about ten times more than a multiply-accumulate operation, and that an SDRAM access would consume about 100× more than the multiply-accumulate operation. One can conclude that in order to reduce the power consumption, one would need to increase data and coefficient reuse factor, where the latter is defined as the average number of multiply-accumulate operations for any single memory access.
A processing engine that is engaged in calculating a convolution, may be characterized by its data flow (i.e., how data is handled by the processing engine). For example, a Weight-Stationary (“WS”) dataflow keeps filter weights (or coefficients) stationary, near the arithmetic unit, by enforcing the following mapping rule: all (or many) MAC operations that use the same filter weight are mapped to the same arithmetic unit for being serially processed thereat, a technique which maximizes the reuse of weights in the local registers. Another example: an Output-Stationary (“OS”) dataflow keeps partial sums stationary by accumulating them locally in local registers. The mapping rule is that all MAC operations that generate partial sums for the same output pixel are mapped on the same arithmetic unit.
VDPP architectures are usually operated while having a combination of data-stationary, weight-stationary and output-stationary data flows. However, in all these data flows there is at least one component (mainly data or coefficients) that has to change every cycle (i.e., reuse factor=1) and thus the power consumption is at least 10-20 times higher than the power invested in the arithmetic calculation itself.
SPPA architecture can support multiple dataflows, such as matrix multiplication data flow as used in Google® Tensor Processing Unit (“TPU”) systolic array. In this case each programable processing element will support the same functionality as a systolic processing element in a TPU.
Y. H Chen et al., in their publication “Using Dataflow to Optimize Energy Efficiency of Deep Neural Network Accelerators”, IEEE Micro, Vol. 37, Issue 3, 2017 (hereinafter “[1]”) showed how other data flows, such as Weight-Stationary, Output-Stationary or Non-Local-Reuse can also be supported by an SPPA architecture. In [1], Y. H Chen et al. also proposed a new data flow that is referred to as a Row Stationary (“RS”) dataflow. They characterized the behavior of the RS dataflow with an SPPA platform they designed. In RS dataflow, each processing element calculates a 1D convolution, with kernel size ‘k’, as follows ( ):
wherein Cx are the kernel's coefficients; and
Dx are the data features.
When calculating a 2D convolution, one has to arrange k processing elements, one above the other, where each PE delivers its result to the PE located above it.
A 2D convolution in a CNN architecture involves accumulating multiple input channels as indicated in the equation below. Every output feature is calculated by accumulating multiple 2D convolutions as follows (where C represent coefficients, D represent data features and B represent convolution bias):
By placing several groups of k processing elements one above the other, one may calculate an output pixel of a 2D CNN convolution.
Every column (in a matrix of N×K processing elements) is responsible to calculate one output feature line. By placing multiple PE columns side by side, one may calculate multiple output lines concurrently. In this structure, input data features move diagonally from one PE to another, coefficients move horizontally from one PE to another and the partial sums' (i.e., the results) move vertically from one PE to another.
Let us consider now the following example: in order to calculate three output lines of a 2D convolution with a kernel=3 and a stride=1, an array of 3×3 processing elements, as shown in
In the first calculation cycle, the bottom row of the marked block is convolved by the left bottom PE. The result is passed to the PE above. In the second calculation cycle the mid row of the marked block is convolved and added to the result obtained from the lower PE. The result is then passed to the PE above. In the third cycle, the top row of the marked block is convolved and added to the result from its preceding PE (which contains the sum of the two lower rows). The result thus obtained is the final convolution.
Now, let us consider a further example as depicted in
RS dataflow was shown to consume less power than WS, OS or NLR dataflows. However, the SPPA architecture is not exploiting the full power savings potential of the RS data flow as its processing elements and NOC are too complex and flexible and thus are not power optimized.
However, the present invention is intended to provide a solution that consumes less power and less SDRAM bandwidth from the solutions known in the art.
The disclosure may be summarized by referring to the appended claims.
It is an object of the present disclosure to provide a method and apparatus that implement an innovative systolic array architecture for the computation of CNN functions.
It is another object of the present disclosure to provide a method and apparatus that implement a VLSI-based, high throughput CNN engine, that consumes less energy and SDRAM bandwidth than other solutions of the art, wherein VLSI is the process of creating an integrated circuit by combining millions of MOS transistors onto a single chip.
It is still another object of the present disclosure to provide a method and apparatus for implementing an architecture that is suitable to achieve a reuse factor that is substantially larger than 1 for features, coefficients and partial sums. Such an architecture utilizes less power and less SDRAM bandwidth than VDPP architectures.
It is another object of the present disclosure to provide a method and apparatus for implementing an architecture that enables reducing system cost, size and heat generation.
It is still another object of the present disclosure to provide a method and apparatus for implementing an architecture that enables edge devices to support more complicated CNN applications (e.g., having higher processing power requirements).
Other objects of the present invention will become apparent from the following description.
According to a first embodiment of the disclosure, there is provided a computational module adapted to be used for carrying out a computer vision application and comprising at least one processing means (e.g., a processor), wherein the computational module is characterized in that it has a systolic array architecture and is configured to receive information conveyed from at least one image sensor and to apply Row-Stationary dataflow for calculating convolutions in a Convolutional Neural Network (“CNN”).
The term “computational module” as used herein throughout the specification and claims, is used to denote a a number of distinct but interrelated units for carrying out a computational process. Such computational modules can be an Application-Specific Integrated Circuit (“ASIC”), or a Field Programmable Gate Array (“FPGA”) or any other applicable processing device.
The term “image sensor” as used herein throughout the specification and claims, is used to denote a sensor that detects and conveys information used to make an image. Typically, it does so by converting the variable attenuation of light waves (as they pass through or reflect off objects) into signals. The waves can be light or another electromagnetic radiation. An image sensor may be used in robotic devices, AR/VR glasses, a drone, a digital camera, smart phones, medical imaging equipment, night vision equipment and the like.
According to another embodiment, there is provided a user device that comprises the computational module, at least one image sensor configured to convey information for generating an image.
In accordance with another embodiment, the at least one processing means comprises at least one macro array of processing elements, wherein each of the at least one macro array comprises a plurality of basic arrays of processing elements and wherein each of the plurality of basic arrays comprises a plurality of processing elements, each of the processing elements is a synchronous digital circuit, operated with a clock signal toggling at a pre-defined frequency, and the at least one macro array, the plurality of basic arrays and the plurality of processing elements are arranged in a layered hierarchical order.
According to another embodiment, the plurality of processing elements comprising each at least one data register, at least one coefficient register, at least one control register, at least one multiplier and at least one adder, and wherein each of the plurality of processing elements is configured to execute a part of the process for calculating convolutions in a CNN, depending on location of each respective processing elements within a corresponding basic array to which it belongs.
By yet another embodiment, the native convolutions supported by the architecture, comprise convolutions having a kernel=3 or a kernel=4, together with a stride=1 or a stride=2, and together with a dilation=1.
In accordance with still another embodiment, the native convolutions having a kernel=3 or a kernel=4, together with a stride=1 or a stride=2 and together with a dilation=1, are used to calculate convolutions with a kernel=3, 4, 5 or 7, together with a stride=1 or stride=2, and together with a dilation that is equal to 1, 2 or 3.
According to another embodiment, at least one from among the plurality of processing elements is adapted to output data belonging to more than one output channel, while reusing the same input data.
According to yet another embodiment the macro array is arranged in a plurality of rows and columns, each row comprises a plurality of basic arrays, and wherein all basic arrays included in a respective row of the macro array are fed with the same input feature stream via a single feature feed unit group, configured to forward features' related data for calculating CNN convolutions using the systolic array architecture.
In accordance with still another embodiment, the macro array is provided with a cache memory for storing features' related data and coefficients' related data thereat.
By yet another embodiment, all processing elements located along an edge of a respective basic array are connected to a single coefficient feed unit group, configured to forward coefficients' related data for all processing elements associated with said respective basic array.
According to another embodiment, the processing elements included in each of the plurality of basic arrays are arranged in a way that ensures a systolic operation of that user device.
In accordance with another embodiment, the processing elements are arranged in each of the plurality of basic arrays in a way that ensures connectivity between the processing elements that enables conveyance of data and control information in order to achieve the systolic operation of the user device.
By still another embodiment, results obtain from one neural network layer are stored back in the cache memory and used as inputs for one or more downstream layers without having to approach an external memory (e.g., an SDRAM) to retrieve required data therefrom.
According to another aspect of the disclosure there is provided a method for carrying out a computer vision application, wherein said method comprises (i) receiving information for generating the computer vision application's output or an intermediate computation result; (ii) providing a systolic array architecture for processing the information received; and (iii) calculating convolutions in a Convolutional Neural Network (“CNN”) by implementing a Row Stationary dataflow.
According to another embodiment of this aspect of the disclosure, the systolic array architecture provided comprises at least one macro array of processing elements, wherein each of the at least one macro array comprises a plurality of basic arrays of processing elements and wherein each of the plurality of basic arrays comprises a plurality of processing elements, each of the processing elements is a synchronous digital circuit, operated with a clock signal toggling at a pre-defined frequency, and the at least one macro array, the plurality of basic arrays and the plurality of processing elements are arranged in an layered hierarchical order.
In accordance with another embodiment, the plurality of processing elements comprising each at least one data register, at least one coefficient register, at least one control register, at least one multiplier and at least one adder, and wherein said method comprises a step of executing by the plurality of processing elements a part of the process for calculating convolutions in a CNN, depending on location of each respective processing elements within a corresponding basic array to which it belongs.
According to yet another embodiment of this aspect of the disclosure the method provided further comprises outputting by at least one of the plurality of processing elements, data belonging to more than one output channel, while reusing the same input data.
By still another embodiment of this aspect of the disclosure the method provided further comprises a step of providing the macro array with a cache memory for storing features' related data and coefficients' related data thereat.
For a more complete understanding of the present invention, reference is now made to the following detailed description taken in conjunction with the accompanying drawing wherein:
In this disclosure, the term “comprising” is intended to have an open-ended meaning so that when a first element is stated as comprising a second element, the first element may also include one or more other elements that are not necessarily identified or described herein, or recited in the claims.
In the following description, for the purposes of explanation, numerous specific details are set forth in order to provide a better understanding of the present invention by way of examples. It should be apparent, however, that the present invention may be practiced without these specific details.
As discussed above, one of the major objects of the present invention is to provide novel method and apparatus that implement innovative systolic array architecture for the computation of CNN functions.
The proposed systolic array architecture of the present invention uses Row Stationary (RS) dataflow. This fact renders the proposed architecture not as flexible as SPPA architecture (for example the one used in the Eyeris project). However, systolic architectures, including the one comprised in the present solution, are more power efficient and area efficient than SPPA architectures, as SPPA architectures sacrifice efficiency for flexibility. In addition, the systolic architecture proposed by the present invention is more scalable than the SPPA solution (i.e., it can be scaled to support much higher processing power). For example, the Eyeriss SPPA chip contains 168 PE's while the architecture proposed by the present solution can contain tens of thousands of PEs in a typical implementation.
Google's TPU is a systolic array that employs a matrix multiplication data flow. As such, it is efficient in terms of power and area per multiply-accumulate operation. However, the proposed architecture of the present invention yields better computational efficiency (i.e., the average activity factor of MAC units), and has a more regular feature and coefficient feed pattern (i.e., it is simpler and more efficient to manipulate data when fetching it from the memory or while storing it in the memory). Thus, the present proposed solution is able to provide lower power and area per “functional” MAC, than can be achieved while implementing the TPU solution.
As discussed above, the present invention relies on innovative systolic array architecture for the computation of CNN functions. Processing arrays that employ Row-Stationary (RS) dataflow are known in the art (employing SPPA architecture). However, contrary to the prior art solutions, the present solution relies on principles that allow one to implement a RS compute engine in an efficient novel way in terms of silicon area, power consumption, functional efficiency (i.e., support of different CNN functions) and computational efficiency (i.e., the ratio between the number of MAC operations per second that are actually computed, and the number of MAC computational units in the array multiplied by the clock frequency).
In order to save power, the present disclosure seeks to implement PEs that are as simple as possible. However, in order to preserve sufficient functional flexibility, the present invention proposes to allow each PE to support the following functions natively: convolution with kernel=3 or 4, stride=1 or 2, and dilation=1. In the proceeding description it will be shown how other kernel sizes and dilation values may be supported by using these basic PEs.
Let us begin by describing the architecture of the present solution that is based on processing elements that calculate results for two output channels. This means that the PE performs two independent convolutions (one after the other), using the same input features but with a different set of coefficients. Later on, the benefits of using this approach will be demonstrated and also it will be shown how to configure an architecture that relies on PEs that calculate one output channel, or more than two output channels.
A high-level structure of a PE according to an embodiment of the present invention is illustrated in
PE 500 is a synchronous digital circuit, operated with a clock signal toggling at frequency=f. The data input to the PE is forwarded via multiplexer (MUX) 502 to data registers 504 from which two flows are being forwarded. One flow exits the PE as a data out flow, whereas the second flow is forwarded via MUX 505 to multiplier 506.
The input coefficients' flow to the PE is forwarded to coefficient registers 508 from which two flows are being forwarded. One flow exits the PE as a weight out flow, whereas the second flow is forwarded via MUX 509 to multiplier 506.
In addition, a result-in flow is introduced to the PE and is forwarded to MUX 510 together with a flow of result out that leaves adder 512. The outcome of MUX 510 together with the outcome of multiplier 506 are forwarded to adder 512. The output of adder 512 is forwarded partially back to MUX 510 as described above, whereas a second flow of the outgoing results leaves the PE.
Finally, a control-in flow reaches the PE, sampled and stored at control registers 514, from which it will leave when appropriate the PE to be used by other PEs as control input.
PE 500 executes one MAC operation per clock cycle. When configured to work with k=3, PE 500 operates with macro cycles of 3 clocks, and when configured to work with k=4, PE 500 operates with macro cycles of 4 clocks.
PE 500 receives and samples new control information every feed cycle (i.e., every two macro cycles), and the control specifies for the PE how it should function in the proceeding calculation cycle. The control is sampled and stored at a control register 514. The control register is routed out, to be used by other PEs as control input. If the control indicates that a calculation cycle is needed, new feature data (i.e., feature) will accompany the control and will be sampled into the data registers 504. If the control indicates that new coefficients should be loaded, new coefficients will accompany the control and will be sampled into coefficient registers 508.
The PE generates Result Out outputs after input-to-output delay from the control sampling time. First the result for channel 0 is delivered out and then, one macro cycle later, the result for channel 1 is delivered out. The PE also uses Result-In inputs for the calculation. Result-In signals are ready one macro cycle before Result-Out are pushed out (first Result-In for channel 0 is available and then Result-In for channel 1 is available). Result-Out signals are used by neighbor PEs as Result-In signals. When Result-Out from a certain PE is used as Result-In by another PE, it means that their feed cycles are shifted by one macro cycle).
Four data registers (designated as 504 in
When stride=2, every feed cycle, two data samples are pushed into D3 and D2, while the old D3 content is moved to D1 and old D2 content is moved into D0. During the calculation process, the multiplier is fed with D3 to D1 for kernel=3 and D3 to D0 for kernel=4.
It should be noted that several external data input sources can be selected for D3 and D2. This enables supporting different kernel/stride modes that require different inter-PE data connectivity (as illustrated in
The multiplier block 506 and adder block 512 in PE 500, calculate the convolution function:
The multiplier and adder in the PE have each at least one pipeline delay (i.e., their respective outputs are delayed by at least one clock cycle with respect to their input). The multiplexers 505 and 509 can also have a pipeline delay if needed. The number of pipeline delays is a dominant factor in the input-to-output delay of the PE.
The following table (Table 1) illustrates an example for the PE operation when kernel=3. Every column in the table represents one clock cycle. In this table, Result-In is designated as Sin, Result-Out as Sout and the multiplier's output as P. The PE's calculation results are shown in bold (the first bold result refers to channel 0 whereas the second to channel 1). In this example the multiplier (including the multiplexer 505 and 509) has a pipeline delay of three clock cycles, and the adder has a pipeline delay of one clock cycle. The input-to-output delay is 6 clock cycles.
P + S
P + S
The following table (Table 2) illustrates another example for a PE operation, this time with a kernel=4 (with pipeline delays that identical to the delays of the previous example).
P + S
P + S
Processing. Elements (PEs) are grouped into Basic Arrays (BA). Each BA is composed of V rows and H columns of processing elements. Preferably but not necessarily, V is dividable by 3 and 4. This allows the BA to be configured to operate in both modes of kernel=3 or kernel=4.
Furthermore, the Fig. depicts coefficients feed signals and Coefficient Feed Units (CFU).
Next, the Fig. illustrates partial sums signals and Post Processing Units (PPUs). These units add the result obtained from the BA to partial results that might have been prepared during earlier calculation cycles.
Finally, control signals are illustrated in
In this example, the underlying assumption regarding data flow in the BA is that each PE adds a delay of one macro cycle to the flows of control, coefficients and partial sum that are conveyed via that PE. So, for example, if PE-B receives its coefficient from PE-A, PE-B will start using the new coefficient set one macro cycle after PE-A.
When the execution of a new calculation task by the engine begins, it will advance like a wave via the BA, namely, starting from the left bottom PE and advancing right and up. The control signals also propagate like a wave within the array. The control input of every PE is connected to the control output of a PE below it or on its left. The PE in the bottom left corner gets its control from a control generation block.
The coefficients flow advances from left to right. The coefficient input of every PE is connected to the coefficient output of the PE on its left. The PEs on the left edge of the array are being fed with coefficients from the Coefficient Feed Units (CFUs).
The Partial Sum flow advances upwardly. The partial sum input to every PE is connected to the partial sum output of the PE located in this FIG. beneath it. The PEs on the bottom row are connected to zero.
A Macro Array (MA) consists of several rows and columns of Basic Arrays (BAs), supports logic around these BAs that is responsible to feed and harvest data from the processing elements, and control logic.
The example demonstrated in
Input features and coefficients are being fed to the Basic Arrays (1502) from a Feature and Coefficient Cache Subsystem 1504 (also referred to hereinbelow as FC-Cache). Output features, after going through convolution and activation function block (1516), are stored back in the FC-Cache Subsystem 1504. FC-Cache Subsystem 1504 is composed of memory blocks and cross connect routing fabric which supports fetching different combinations of features and coefficient patterns as needed for performing different CNN functions by the MA. The FC-Cache is connected to the outer world (e.g., to an SDRAM) via regular SOC bus (e.g., AXI). New input features and coefficient can be brought from the SDRAM to the FC-Cache via this path, and new output features can be sent back to the SDRAM via this path. The FC-Cache has the following purposes:
The specific structure of the FC-Cache is not described in this example, and any suitable structure for such an FC-Cache, may be implemented by those skilled in the art without departing from the scope of the present invention.
All the BAs which are located at the same row of the MA row are connected (in parallel) to the same FFU group (1506). This configuration is helpful in improving the feature's reuse factor, as more output channels are being calculated concurrently using the same input features. If for example, the MA consists of 16 columns of BAs, and since each BA calculates two output channels, each row in the MA calculates 32 output channels concurrently.
Each FFU group (1506) contains V+V/1.2×(H−1) FFUs, arranged in a way that can support all configurations as illustrated in
An example of an FFU is illustrated in
Each BA in the MA is connected to a different CFU group (as illustrated in
Each CFU comprises a FIFO with some support logic. Coefficients coming from the FC-Cache subsystem are pushed to the FIFO and, when needed, they are pulled out and feed the PEs. Push control signals accompany the respective coefficients that are introduced to the CFU and Pull Control signals indicate when it is time to pull out the coefficients. The timing of the pull control signals should preferably be perfectly synchronized with the processing timing of the PEs.
The calculation results are forwarded from BAs to Post Processing Units. The target of the PPUs is to aggregate results coming from multiple calculation tasks and/or different BAs, and add bias. In operation, results coming out from the same BA but at different calculation cycles may be aggregated, as well as results that come from different BAs during the same calculation cycles which may be aggregated too, or any combination of these two types of results' aggregations. The PPUs operate closely with a Partial Sum cache memory (designated as P-Cache in
If there is no earlier partial sum in the P-Cache, the PPU adds the bias value instead.
In addition to the calculation unit discussed above, each PPU comprises a FIFO for partial sums coming into the PPU, a FIFO for the partial sums going out of the PPU and a FIFO for the bias coefficients. The use of FIFOs helps to detach between the time of loading (or storing) the data from memories and the time data used. In other words, the relevant data can be brought or sent asynchronously for the actual time calculation. The bias coefficients can be brought from the FC-cache or from another data source. Here too, the timing of the calculation control signals should preferably be perfectly synchronized with the processing timing of the PEs.
Each PPU group comprises at least H PPUs (one PPU associated with each PE column of a BA). It is possible to multiply the number of PPUs per group by the number of BA rows per MA. In this case, BAs in different rows can operate independently. For example, if one has a 2×16 BAs in the MA (as depicted in
After a convolution result is ready in the Partial Sum Cache, it is conveyed via an Activation unit back to the FC-Cache.
On top of the described-above, the MA comprises a control unit. The control unit is responsible to generate control signals to all components included in the architecture described, such that data flow and computations are synchronized.
A Processing Task is a processing procedure carried out by the MA of a data chunk, using a certain set of coefficients, which are not changed throughout the task. Calculation of a convolutional layer may take more than one Processing Task. In the following description, several modes of Processing Tasks are discussed.
In this example, the results obtained from two (or more) BA rows are aggregated in the PPUs. If for example, each BA has 12 rows and 4 columns of PEs, and the MA comprises 2 rows and 16 columns of BAs, then:
In this mode of operation, different BA rows operate in parallel on the same inputs, generating different outputs.
When implementing this mode of operation, different BA rows operate on different inputs and generate different outputs. This can be used for processing different output lines with the same input channels and output channels, or processing different input channels and output channels.
The convolution systolic array of the present disclosure supports inherently convolutions with kernels of 3 and 4 and strides of 1 and 2, and dilation of 1. The following description explains how other convolution functions can be supported.
One may calculate kernel=5 convolution (with stride=1 and dilation=1) by using kernel=3 convolvers, where a convolver is an algorithm that performs convolution on a signal. The input feature maps are split into 4 groups (w, x, y, z) and the output feature maps into 4 groups (a, b, c, d), as illustrated in
First, one of the groups' output features are calculated (for example features from group ‘a’). This is preferably carried out by summing the results of four kernel=3 convolutions as illustrated in
Once the calculation of the first group's output pixels is concluded, next, the second output pixels' group (e.g., group ‘b’) is calculated in a similar way (as illustrated in the following
Stride 2 convolution can be calculated in a similar way to the calculation demonstrated for stride 1. However, in this case only the pixels of group ‘a’ are calculated.
For a kernel=7 convolution a similar concept may be used as the one described for the kernel=5 case. However, in this case 4×4 convolvers are used. This type of operation is illustrated in the following
Convolution with Dilation=2
Dilated Convolutions are a type of convolution that “inflate” the kernel by inserting holes between the kernel elements. An additional parameter (dilation rate) indicates the extent by which the kernel is widened. Usually there are spaces inserted between kernel elements.
In the case of Convolution with Dilation=2, again, the concept of 4 input pixel groups and four output pixel groups is used (as shown in
For a kernel=3 convolution with Dilation=2 (
For a kernel=4 convolution with Dilation=2 the following is carried out: for calculating the first group's output pixels, kernel=4 convolver is used with the first group input pixels. For calculating the second group's output pixels, kernel=4 convolver is used with the second group's input pixels, and the like.
Convolution with Dilation=3
In this case the input and output pixels are split into 9 groups (‘A’ to ‘I’ and ‘a’ to ‘i’) as illustrated in
In the preceding description, the architecture referred to was based on PEs that support two output channels. It should be noted that one can implement a similar architecture by using a single output channel per PE. However, in this case the following differences will apply:
One may also implement a similar architecture with more than two output channels per PE. However, in this case the following differences will apply:
The following summarizes some of the main advantages of the present disclosure:
In the description and claims of the present application, each of the verbs, “comprise” “include” and “have”, and conjugates thereof, are used to indicate that the object or objects of the verb are not necessarily a complete listing of members, components, elements or parts of the subject or subjects of the verb.
The present invention has been described using detailed descriptions of embodiments thereof that are provided by way of example and are not intended to limit the scope of the invention in any way. The described embodiments comprise different features, not all of which are required in all embodiments of the invention. Some embodiments of the present invention utilize only some of the features or possible combinations of the features. Variations of embodiments of the present invention that are described and embodiments of the present invention comprising different combinations of features noted in the described embodiments will occur to persons of the art. The scope of the invention is limited only by the following claims.