This application claims foreign priority under 35 USC 119 from United Kingdom Patent Application No. 2216151.7 filed on 31 Oct. 2022, which is incorporated by reference herein in its entirety.
This application is directed to methods and systems for performing a per channel affine transformation on an input tensor using a neural network accelerator.
A Deep Neural Network (DNN) is a form of artificial neural network comprising a plurality of interconnected layers that can be used for machine learning applications. In particular, a DNN can be used in signal processing applications, including, but not limited to, image processing and computer vision applications.
The data input to and output from a layer of a DNN can be described as a tensor. As is known to those of skill in the art, a tensor is a generalization of vectors and matrices and can be considered as an n-dimensional array. A vector is a one-dimensional tensor, and a matrix is a two-dimensional tensor. The tensors in a DNN are often, but are not necessarily, four-dimensional. Reference is made to
The processing that is performed on the input data to a layer depends on the type of layer. For example, each layer of a DNN may be one of a plurality of different types. Example DNN layer types include, but are not limited to, a convolution layer, an activation layer, a normalisation layer, a pooling layer, a fully connected layer, and a batch normalisation layer. It will be evident to a person of skill in the art that these are example DNN layer types and that this is not an exhaustive list and there may be other DNN layer types.
A convolution layer convolves the input data with weights associated with the layer. Specifically, each convolution layer is associated with a plurality of weights k1 . . . kg, which may also be referred to as filter weights or coefficients. The weights are grouped to form one or more filters or kernels, and each filter may be associated with an offset bias bias. Each filter may have a dimension KW×KH×Cin (i.e., each filter may comprise a set of KW×KH×Cin weights k) and may be applied to the input data according to a convolution operation across steps sW and sH in the W and H directions as shown in
An activation layer, which often, but not necessarily, follows a convolution layer, applies one or more activation functions to the input data to the layer. An activation function receives an input tensor and performs a certain non-linear mathematical operation on each value or element in the input tensor. In other words, the activation function operates on each value or element in the input tensor separately. In some examples, an activation layer may act as rectified linear unit (ReLU) by implementing a ReLU function or a leaky rectified linear unit (LReLU) by implementing a LReLU function.
A normalisation layer is configured to perform a normalising function, such as a Local Response Normalisation (LRN) function on the input data.
A pooling layer performs a pooling function, such as a max, min or average function, to summarise subsets of the input data. The purpose of a pooling layer is thus to reduce the spatial size of the representation to reduce the number of parameters and computation in the network, and hence to also control overfitting.
A fully connected layer, which often, but not necessarily, follows a plurality of convolution and pooling layers, takes a two-dimensional tensor (e.g. a tensor with a batch size and a channel dimension) of input data values and outputs a two-dimensional tensor (e.g. a tensor with a batch size dimension and a channel dimension). Where the DNN is used for classification, the output may have A channels where A is the number of classes, and each value in the tensor may represent the probability of a certain class. The output tensor is generated through a matrix multiplication of a set of weights, optionally followed by a bias offset. A fully connected layer thus receives a set of weights and may receive a bias.
A batch normalisation (often referred to as “batch norm”) layer, which often, but not necessarily, follows a convolution layer, applies a per channel affine transformation to an input tensor. Batch normalisation layers may be added to a neural network to make training of the neural network faster and more stable by normalisation of a subsequent layer's inputs by re-centring and re-scaling.
Accordingly, each layer of a DNN receives input data values (e.g., an input tensor) and generates output data values (e.g., an output tensor); and some layers (such as, but not limited to, convolution layers and fully connected layers) also receive weights and/or biases. The input data values, output data values, weights and biases of the layers of a DNN may collectively be referred to as the network parameters of the DNN.
Performing forward and backward passes of a DNN are often expensive to implement in terms of computation, bandwidth and power. Accordingly, neural network accelerators (NNAs) have been developed that allow DNNs to be implemented in an efficient manner (e.g., in a manner that requires less silicon area or less processing power).
For an NNA to implement a neural network the parameters of the neural network are represented in a number format. Two common types of number formats are fixed point number formats and floating point number formats. As is known to those skilled in the art, a fixed point number format has a fixed number of digits after the radix point (e.g., decimal point or binary point). In contrast, a floating point number format does not have a fixed radix point (i.e., it can “float”). In other words, the radix point can be placed in multiple places within the number. There are also other formats that are similar to fixed point number formats but where the exponent is a fixed real number (stored in floating point format) rather than an integer. For these formats 2exponent is commonly referred to as the “scale”. Such formats may also possess a “zero point” which acts as an offset to the stored values to allow for asymmetric number ranges. The stored value xint corresponds to the value x=scale*(xint−zero point). While representing the network parameters of a DNN in a floating point number format may allow more accurate or precise output data to be produced, processing network parameters in a floating point number format is complex which tends to increase the silicon area, power consumption, memory and bandwidth consumption, and complexity of the hardware to implement the neural network. Accordingly, an NNA may be configured to represent at least some of the network parameters in another format, such as a fixed point number format, to reduce the hardware area, power consumption, memory and bandwidth consumption and complexity of the hardware to implement the neural network.
Generally, the fewer the bits that are used to represent the network parameters of a DNN (e.g., input data values, weights, biases, and output data values), the more efficiently the DNN can be implemented in hardware. However, typically the fewer the bits that are used to represent the network parameters of a DNN (e.g., input data values, weights, biases, and output data values) the less accurate the DNN becomes. Accordingly, it is desirable to implement a DNN using a reduced number of bits without compromising the accuracy of the DNN.
The embodiments described below are provided by way of example only and are not limiting of implementations which solve any or all of the disadvantages of methods and systems for implementing all or a part of a neural network on a neural network accelerator.
This Summary is provided to introduce a selection of concepts in a simplified form that are further described below in the Detailed Description. This Summary is not intended to identify key features or essential features of the claimed subject matter, nor is it intended to be used to limit the scope of the claimed subject matter.
Described herein are methods and neural network accelerators for implementing a per channel quantised affine transformation on an input tensor to generate an output tensor, the affine transformation for a channel comprising a multiplication by a multiplication parameter for the channel followed by an addition of an addition parameter for the channel. The neural network accelerator comprises a depth-wise convolution processing unit configured to accelerate depth-wise convolution operations, and a high-precision re-quantisation processing unit configured to accelerate re-quantisation or quantisation operations. The methods include implementing each affine transformation as an addition followed by a multiplication by: implementing the additions by performing, using the depth-wise convolution processing unit, a 1×1 depth-wise convolution on the input tensor based on a weight and a bias for each channel of the input tensor, wherein the weight for a channel is set to an addition scale factor for the channel and the bias for a channel is set to an integer addition value for the channel; and implementing the multiplications by scaling, using the high-precision re-quantisation processing unit, each value output from the depth-wise convolution processing unit, by a high-precision multiplication value for the corresponding channel.
A first aspect provides a method of implementing, on a neural network accelerator, a per channel quantised affine transformation on an input tensor to generate an output tensor, the affine transformation for a channel comprising a multiplication by a multiplication parameter for the channel followed by an addition of an addition parameter for the channel, the method comprising: implementing each affine transformation as an addition followed by a multiplication by: implementing the additions by performing, using a depth-wise convolution processing unit of the neural network accelerator, a 1×1 depth-wise convolution on the input tensor based on a weight and a bias for each channel of the input tensor, wherein the weight for a channel is set to an addition scale factor for the channel and the bias for a channel is set to an integer addition value for the channel; and implementing the multiplications by scaling, using a high-precision re-quantisation processing unit of the neural network accelerator, each value output from the depth-wise convolution processing unit, by a high-precision multiplication value for the corresponding channel.
The input tensor may comprise a plurality of input tensels, each input tensels, is in a quantised number format represented by an integer value and a common input scale. The output tensor may comprise a plurality of output tensels, each output tensel is in a quantised number format represented by an integer value and a common output scale. The 1×1 depth-wise convolution may be performed on the integer values of the input tensor. The high-precision re-quantisation processing unit may output the integer values of the output tensor.
For each channel of the input tensor that has a non-zero multiplication parameter: the addition value for the channel may be based on the addition parameter for the channel, the addition scale factor for the channel, the multiplication parameter for the channel, and the common input scale, and the high-precision multiplication value for the channel may be based on the multiplication parameter for the channel, the addition scale factor for the channel, the common input scale and the common output scale.
For each channel of the input tensor that has a non-zero multiplication parameter, the integer addition value may represent a ratio of (i) a product of the addition scale factor for the channel and an addition variable for the channel and (ii) a product of the multiplication parameter for the channel and the common input scale, wherein the addition value for the channel is based on the addition parameter for the channel.
The addition variable for a channel may be equal to the addition parameter for that channel.
The addition variable for a channel may be a combination of the addition parameter for that channel and one or more other values.
Each input tensel may be in an asymmetric quantised number format based on a common input zero point, and the addition variable for a channel may be a difference between the addition parameter for the channel and a product of the multiplication parameter for that channel, the common input scale, and the common input zero point.
Each output tensel may be in an asymmetric quantised number format based on a common output zero point, and the addition variable for a channel may be a combination of the addition parameter for the channel and a product of the common output scale and the common output zero point.
Each input tensel may be in an asymmetric quantised number format based on a common input zero point and each output tensel is in an asymmetric quantised number format based on a common output zero point, and the addition variable for a channel is a combination of the addition parameter for the channel, the product of the common output scale and the common output zero point, and the product of the multiplication parameter for that channel, the common input scale, and the common input zero point.
For each channel of the input tensor that has a non-zero multiplication parameter, the high-precision multiplication value for the channel may represent a ratio of (i) the product of the multiplication parameter for the channel and the common input scale, and (ii) the product of the common output scale and the addition scale factor for that channel.
For each channel of the input tensor that has a non-zero multiplication parameter, the addition scale factor may be an integer greater than or equal to one.
Each channel of the input tensor that has a non-zero multiplication parameter may have the same addition scale factor.
The addition scale factor for each channel of the input tensor that has a non-zero multiplication parameter may be a largest weight accepted by the depth-wise convolution processing unit.
The addition scale factor for each channel of the input tensor that has a non-zero multiplication parameter may be a largest power of two weight accepted by the depth-wise convolution processing unit.
For each channel of the input tensor that has a non-zero multiplication factor, the addition scale factor may be equal to one.
The input tensor may comprise at least two channels that have a non-zero multiplication parameter, and two of the at least two channels may have a different addition scale factor.
The addition scale factor, for at least one channel of the input tensor that has a non-zero multiplication parameter, may be an addition scale factor that minimizes
wherein m is the addition scale factor for the channel, b is the addition parameter for the channel, a is the multiplication parameter for the channel and sx is the common input scale.
For each channel of the input tensor that has a multiplication parameter that is zero, or substantially zero: the addition scale factor for the channel may be equal to zero; the addition value for the channel may be an integer that represents a ratio of the addition parameter for the channel and the common output scale; and the high-precision multiplication value for the channel may be equal to one.
The high-precision re-quantisation processing unit may be configured to scale an input value by multiplying the input value by a high-precision multiplication input and shifting the output of the multiplication in accordance with a shift value, and the high-precision multiplication value for a channel is provided to the high-precision re-quantisation processing unit as a multiplication input and a shift value.
The multiplication parameter for a channel may be a floating point value and the addition parameter for a channel may be a floating point value.
The neural network accelerator may comprise a convolution processing unit configured to accelerate the processing of two-dimensional convolution operations and the convolution processing unit is separate and distinct from the depth-wise convolution processing unit.
The method may further comprise computing the addition value for each channel offline using a device external to the neural network accelerator prior to performing the 1×1 depth-wise convolution.
The depth-wise convolution processing unit may comprise hardware to remove a zero point from the input integer values. The method may further comprise, when the input tensels are in an asymmetric quantised number format based on a common input zero point, causing the depth-wise convolution processing unit to remove the common input zero point offset from the received input integers of channels of the input tensor with a non-zero multiplication parameter, prior to performing the 1×1 depth-wise convolution.
The high-precision re-quantisation processing unit may comprise hardware to add a zero point to scaled values. The method may further comprise, when the output tensels are in an asymmetric quantised number format based on a common output zero point, causing the high-precision re-quantisation processing unit to add the common output zero point to the scaled values corresponding to channels of the input tensor with a non-zero multiplication parameter.
The depth-wise convolution processing unit may be configured to receive biases up to a predetermined maximum number of bits.
A second aspect provides a neural network accelerator comprising: a depth-wise convolution processing unit configured to accelerate processing of a depth-wise convolution on a received input tensor; and a high-precision re-quantisation processing unit configured to accelerate processing of a per channel re-quantisation operation on a received input tensor; wherein the neural network accelerator is configured to performing a per channel quantised affine transformation on an input tensor to generate an output tensor, the affine transformation for a channel comprising a multiplication by a multiplication parameter for the channel followed by an addition of an addition parameter for that channel, by: implementing each affine transformation as an addition followed by a multiplication by: implementing the additions by performing, using the depth-wise convolution processing unit, a 1×1 depth-wise convolution on the input tensor based on a weight and a bias for each channel of the input tensor, wherein the weight for a channel is set to an addition scale factor for the channel and the bias for a channel is set to an integer addition value for the channel; and implementing the multiplications by scaling, using the high-precision re-quantisation processing unit, each value output from the depth-wise convolution processing unit, by a high-precision multiplication value for the corresponding channel.
A third aspect provides a neural network accelerator configured to perform the method of the first aspect.
A fourth aspect provides a computer readable storage medium having stored thereon computer readable code configured to cause a neural network accelerator to perform the method of the first aspect when the code is run.
A fifth aspect provides a method of processing input data in accordance with a neural network comprising at least one batch normalisation layer that implements a per channel affine transformation at a neural network accelerator by implementing the batch normalisation layer on the neural network accelerator in accordance with the first aspect.
The neural network accelerators described herein may be embodied in hardware on an integrated circuit. There may be provided a method of manufacturing, at an integrated circuit manufacturing system, a neural network accelerator described herein. There may be provided an integrated circuit definition dataset that, when processed in an integrated circuit manufacturing system, configures the system to manufacture an integrated circuit that embodies a neural network accelerator described herein. There may be provided a non-transitory computer readable storage medium having stored thereon a computer readable description of a neural network accelerator that, when processed in an integrated circuit manufacturing system, causes the integrated circuit manufacturing system to manufacture an integrated circuit embodying the neural network accelerator.
There may be provided an integrated circuit manufacturing system comprising: a non-transitory computer readable storage medium having stored thereon a computer readable description of a neural network accelerator described herein; a layout processing system configured to process the computer readable description so as to generate a circuit layout description of an integrated circuit embodying the neural network accelerator; and an integrated circuit generation system configured to manufacture an integrated circuit embodying the neural network accelerator according to the circuit layout description.
There may be provided computer program code for performing any of the methods described herein. There may be provided non-transitory computer readable storage medium having stored thereon computer readable instructions that, when executed at a computer system, cause the computer system to perform any of the methods described herein.
The above features may be combined as appropriate, as would be apparent to a skilled person, and may be combined with any of the aspects of the examples described herein.
Examples will now be described in detail with reference to the accompanying drawings in which:
The accompanying drawings illustrate various examples. The skilled person will appreciate that the illustrated element boundaries (e.g., boxes, groups of boxes, or other shapes) in the drawings represent one example of the boundaries. It may be that in some examples, one element may be designed as multiple elements or that multiple elements may be designed as one element. Common reference numerals are used throughout the figures, where appropriate, to indicate similar features.
The following description is presented by way of example to enable a person skilled in the art to make and use the invention. The present invention is not limited to the embodiments described herein and various modifications to the disclosed embodiments will be apparent to those skilled in the art.
Embodiments will now be described by way of example only.
As described above, a DNN may comprise one or more batch normalisation layers each of which is implemented at inference time (e.g. during a forward pass of the neural network) as a per-channel affine transformation. In other words, for a batch normalisation layer there is an affine transformation per input channel that is applied to the corresponding, or respective, input channel. The affine transformation applied to an input channel may be represented by equation (1) where x is the input element or tensel, y is the output element or tensel, and a and b are floating point parameters. The first parameter a may be referred herein to as the multiplication parameter and the second parameter b may be referred to as the addition parameter. The affine transformations associated with different input channels may be the same or different. In other words, the parameters a and/or b may differ between input channels.
y=ax+b (1)
As described above, DNNs are often expensive to implement in terms of computation, bandwidth and power. Accordingly, neural network accelerators (NNAs) have been developed that allow DNNs to be implemented in an efficient manner (e.g., in a manner that requires less silicon area or less processing power).
An NNA is hardware that is designed to accelerate the processing of a neural network. As is known to those of skill in the art, a hardware accelerator is hardware designed to perform a specific set of one or more functions more efficiently than a general processing unit, such as a central processing unit (CPU). Accordingly, in contrast to a general CPU which can be configured to perform any number of functions, an accelerator can only perform a limited set of one or more functions. NNAs have one or more network processing hardware units (which may simply be referred to as processing units) which are each designed to accelerate one or more neural network operations. A neural network operation is defined herein as an operation that is used to implement all or a part of a neural network layer. A neural network layer may be implemented by one or more neural network operations. Example neural network operations include, but are not limited to, convolution operations, non-linear operations, pooling operations and normalisation operations.
An NNA may therefore have, for example, a convolution processing unit which is configured to accelerate convolution operations, an activation processing unit which is configured to accelerate non-linear operations, a pooling processing unit which is configured to accelerate pooling operations, and/or a normalisation processing unit configured to accelerate normalisation operations. It will be evident to a person of skill in the art that this is just an example set of network processing hardware units that an NNA may have, and NNAs may have additional network processing hardware units, fewer network processing hardware units or a different combination of network processing hardware units.
As described above, while representing the network parameters of a neural network in a floating point number format may allow more accurate or precise output data to be produced by the neural network, processing network parameters in a floating point number format in hardware is complex which tends to increase the silicon area, power consumption, memory and bandwidth consumption, and complexity of an NNA compared to an NNA that processes network parameters in quantised formats, such as, but not limited to, fixed point number formats. Accordingly, an NNA may comprise hardware that is configured to receive and process network parameters in a quantised format, such as a fixed point number format, to reduce the area, power consumption, memory and bandwidth consumption, and complexity of the NNA.
Accordingly, the input and output of an affine transformation implemented on an NNA may be quantised integer values xq and yq. Ideally the input xq is de-quantised by a de-quantisation operation (“DQ”) to generate a floating point input x so that the operations are performed in floating point, then the output y is quantised by a quantisation operation (“Q”) as shown in
x=s
x
x
q (2)
y=s
y
y
q (3)
The affine transformation of equation (1) can be expressed in terms of the quantised values of equations (2) and (3) as shown in equation (4) and illustrated in
While the parameters (a, b) of the affine transformations may also be quantised, testing has shown that performing an affine transformation, and particularly an affine transformation that is part of batch normalisation layer, with quantised parameters can significantly reduce the accuracy of the affine transformation which can reduce the accuracy of the neural network itself. The parameters (a, b) for an affine transformation that forms part of a batch normalisation layer, are usually left unquantised during quantisation aware training (i.e. training in which the neural network is trained using quantised weights and input data values) because during training the parameters (a, b) change, potentially significantly, from batch to batch because of their dependence on the batch statistics, making it difficult to select appropriate quantisation parameters for them. Accordingly, ideally a quantised affine transformation would be performed in floating point (e.g. via floating point multiplication and addition) and the result quantised as shown in equation (4). However, for the reasons noted above, an NNA may not have hardware to perform the desired floating point operations in an efficient manner, certainly not for a stand-alone per channel quantised affine transformation. Specifically, quite often a batch normalisation layer follows a convolution layer and thus can be merged or fused with the convolution layer and performed efficiently using a convolution processing unit of an NNA. However, in some networks there is not a preceding convolution layer, or it would not be efficient (for one or more reasons) to merge a batch normalisation layer with a convolution layer. A per-channel affine transformation (which may or may not form part of a batch normalisation layer) that is not fused or merged with a convolution layer (or a convolution operation) is referred to herein as a stand-alone per-channel affine transformation. As described in more detail below, convolution processing units are generally complex hardware designed to implement 2D convolutions, thus it may not be efficient to perform a stand-alone per-channel affine transformation on a convolution processing unit.
Accordingly, the inventors have developed a method that allows a per-channel quantised affine transformation (and in particular a standalone per-channel quantised affine transformation) to be performed efficiently and accurately on a neural network accelerator with (i) a depth-wise convolution processing unit configured to accelerate processing of a quantised depth-wise convolution, and (ii) a re-quantisation processing unit configured to accelerate a high-precision per-channel re-quantisation operation. Specifically, in the methods described herein each quantised affine transformation is represented as an addition followed by a multiplication and the additions are implemented by performing, using the depth-wise convolution processing unit, a 1×1 depth-wise convolution on the input tensor using appropriate weight and bias values, and the multiplications are implemented by, scaling, using the high-precision re-quantisation processing unit, the outputs of the depth-wise convolution processing unit by an appropriate per channel multiplication value. The methods described herein take advantage of (i) the proficiency and efficiency of the depth-wise convolution processing unit and the re-quantisation processing unit at performing per-channel operations, and (ii) the high-precision addition and scaling that can be performed by the depth-wise convolution processing unit and the re-quantisation processing unit respectively.
As described in more detail below, testing has shown that implementing a per-channel quantised affine transformation in accordance with the methods described herein can achieve similar accuracy as performing the per channel quantised affine transformation in floating point (e.g. performing floating point multiplications and additions, e.g. in accordance with equation (4)). In other words, the methods described herein can be used to accurately emulate a floating point implementation of a per-channel quantised affine transformation without performing floating point operations.
An example of a neural network accelerator 500 suitable for implementing the methods described herein is shown in
The input unit 502 is hardware configured to receive and store the input data to the neural network accelerator 500. The input data may be received from external memory. In some examples, the input unit 502 may comprise one or more buffers to store the received input data. Although the example NNA 500 of
Each processing unit 504, 506, 508, 510, 512, is itself an accelerator configured to accelerate performing one or more neural network operations on input data. Specifically, each processing unit 504, 506, 508, 510, 512 is configured to receive an input tensor and perform, via hardware logic, one or more operations on the input tensor to generate an output tensor. The NNA 500 of
The output unit 514 is hardware configured to receive the output tensor generated by processing the input data via one or more processing units 504, 506, 508, 510, 512. In some cases, the output unit 514 may have a buffer or other storage for temporarily storing the output tensor prior to outputting the output tensor from the NNA 500. In some cases, the output unit 514 may be configured to save the output tensor in external memory (i.e., memory that is external to the neural network accelerator).
The interconnection hardware 516 statically or dynamically connects the input unit, one or more processing units, and the output unit to allow input data to the neural network accelerator to flow through (e.g. be processed by) one or more processing units and then be output from the neural network accelerator. In some cases, the interconnection hardware 516 may comprise fixed hardware connections between the input unit 502, the processing units 504, 506, 508, 510, 512, and the output unit 514 that allow data to flow through the units in a limited number of ways. However, in other cases, the interconnection hardware 516 may comprises hardware that can dynamically connect the units 502-514 of the neural network accelerator in a plurality of different ways in response to one or more control signals. For example, the interconnection hardware 516 may comprise a crossbar and the units 502-514 may be connected to the crossbar in such a manner that the crossbar can dynamically connect the units in a plurality of different ways in response to one or more control signals. For example, in one hardware pass the crossbar may connect the output of the input unit to the depth-wise convolution processing unit 504, connect the output of the depth-wise convolution processing unit 504 to the high-precision re-quantisation processing unit, and then connect the output of the high-precision re-quantisation processing unit to the input of the output unit 514 so that the input data for the hardware pass is processed by the depth-wise convolution processing unit then the high-precision re-quantisation processing unit. In another hardware pass, the crossbar may connect the output of the input unit 502 to the convolution processing unit 508, and then the output of the convolution processing unit 508 to the input of the output unit 514 so that the input data for the hardware pass is processed by the convolution processing unit. Accordingly, in these cases the connections between the units 502-514 of the neural network accelerator (and thus the manner in which data may flow through the units of the NNA) are not fixed or static.
Although, not shown, the units 502-514 and the interconnection hardware 516 of the NNA may receive control information for each hardware pass indicating which units are to be active or used in the hardware pass and how each active unit and the interconnection unit are to be configured for that hardware pass. The control information may also indicate other information such as the formats of the input and output data of the units.
Reference is now made to
While 2D convolutions are the most common type of convolutions implemented in a neural network, other convolutions, such as depth-wise convolutions, are becoming more prominent in neural networks. As is known to those of skill in the art, in a depth-wise convolution there is a set of one or more filters per channel of the input tensor, and each channel of the input tensor is convolved with each filter in the corresponding set of filters to generate an output channel. In some cases, there is a single filter per channel, but in other cases there may be multiple filters per channel. The number of filters per channel may be referred to as the channel multiplier, T. Therefore the number of channels of the output tensor of a depth-wise convolution is T*C.
It can be seen that a depth-wise convolution is a simpler operation than a 2D convolution. Accordingly, it may be inefficient to use the complicated convolution processing unit to implement a depth-wise convolution. Thus a neural network accelerator may have specialised hardware, such as depth-wise convolution processing unit, to accelerate a depth-wise convolution. Such a depth-wise convolution processing unit thus has hardware to receive for each channel input data values, weights, and biases; and for each channel, calculate a weighted-sum of various combinations of inputs and weights, and add the bias to the weighted sum. As described above, to simplify a depth-wise convolution processing unit, the depth-wise convolution processing unit may be configured to receive integer values (rather than floating point values) and perform integer operations thereon. Such a depth-wise convolution processing unit may be referred to as an integer depth-wise convolution processing unit.
Returning back to
As shown in
In some cases, the depth-wise convolution engines 602 may be configured to receive weights and input integers with a first predetermined maximum bit width and receive biases with a second predetermined maximum bit width, wherein the second predetermined bit width is larger than the first predetermined bit width. For example, in some cases, each depth-wise convolution engine 602 may be configured to receive low-precision weights and input values (e.g. 8 bits) and high-precision biases (e.g. 32 bits). Having relatively low-precision weights and input values may simplify the multiplication logic used to implement the weighted-sum unit 604. Having a high-precision bias may allow a higher accuracy output.
As shown in
In some cases, to support input tensels in an asymmetric quantised number format, which as described in more detail below means that each tensel is represented by an integer value, a common input scale and a common zero point offset, each depth-wise convolution engine 602 may comprise a subtraction unit 610 that is configured to receive the common input zero point offset and subtract the common zero point offset from the input integer values prior to calculating the weighted sum. Similarly, to support weights in an asymmetric quantised number format each depth-wise convolution engine 602 may comprise a second subtraction unit (not shown) that is configured to subtract a weight zero point offset from the weights prior to calculating the weighted sum. While there may be a common input zero point offset for the input tensor, in some cases, there may be a weight zero point per channel.
In some cases, the depth-wise convolution processing unit 504 may comprise, or have access to, one or more storage units for storing the input values, the weights, the biases, the zero point offsets and/or the output values. For example, as shown in
Having multiple depth-wise convolution engines 602 allows multiple channels of the input tensor to be processed at the same time. Depending on the number of channels in the input tensor, and the number of depth-wise convolution engines 602, a depth-wise convolution engine 602 may process more than one channel of the input tensor. For example, a first depth-wise convolution engine 602 may first receive and process a channel (e.g. channel 0) of the input tensor to generate one channel of the output tensor, and subsequently receive and process another channel (e.g. channel 5) of the input tensor to generate another channel of the output tensor. In some cases, the depth-wise convolution processing unit 504 may be configured to divide the channels of the input tensor equally amongst the depth-wise convolution engines 602. In some cases, the depth-wise convolution processing unit 504 may be configured to interleave the processing of multiple channels on a depth-wise convolution engine 602. For example, a depth-wise convolution engine 602 may be configured to generate part of the output for a first channel, generate part of the output for a second channel, and then go back to processing the first channel.
Also, although not shown in
Reference is now made to
The example re-quantisation processing unit 506 of
In some cases, each re-quantisation engine 902 may also comprises an addition unit 910 which is configured to add a zero point or offset to the output of the shift and round unit 906, and a clamp unit 912 which is configured to clamp the output of the addition unit 910 to a desired range. The clamp unit 912 may operate in the same manner as the clamp unit 608 of
The re-quantisation processing unit 506 is able to perform the re-quantisation on a per-tensor basis (i.e. all of the input tensels in the tensor are re-quantised in the same manner—e.g. using the same re-quantisation factor, zero point offset)—or on a per-channel basis (i.e. each channel of the input tensor may be re-quantised in a different manner—e.g. using different re-quantisation factors and/or different zero point offsets). In some cases, the re-quantisation processing unit 506 may receive control information for a hardware pass indicating how the re-quantisation processing unit 506 is to be configured for that hardware pass—e.g. whether it is to perform per channel or per-tensor re-quantisation, whether a zero point offset is to be added etc.
In some cases, the re-quantisation processing unit 506 may comprise, or have access to, one or more storage units for storing the parameters for a re-quantisation operation (e.g. multiplication values for each channel, shift values for each channel etc.). For example, as shown in
As described above, the inventors have determined that a per channel quantised affine transformation can be performed efficiently and accurately on such a neural network accelerator by representing each affine transformation as an addition followed by a multiplication and performing the additions by performing, via the depth-wise convolution processing unit, a 1×1 depth-wise convolution on the input tensor using appropriate per-channel weights and biases to implement the affine transformations, and performing the multiplications by scaling, via the high-precision re-quantisation processing unit, the outputs of the depth-wise convolution processing unit, by appropriate per channel re-quantisation factors to implement the affine transformations.
As noted above, a quantised affine transformation can be written in terms of the quantised values as shown in equation (4). Separating the multiplications from the addition in equation (4) produces equation (5) (assuming a≠0) . To allow the addition to be performed by an integer addition unit, such as the bias addition unit 606 of the depth-wise convolution processing unit 504 of
The inventors have determined that equation (6) can be efficiently and accurately implemented on a neural network accelerator, such as the neural network accelerator of
Rounding the ratio of (i) the addition parameter b for a channel and (ii) the product of the multiplication parameter a for that channel and the input scale sx
To increase the accuracy of the output integer value yq,
As shown in
There may be a per tensor addition scale factor m or a per channel addition scale factor m. The addition scale factor(s) may be selected in any suitable manner. For example, in some cases, a per tensor addition scale factor m may be selected to be the largest weight value accepted by the depth-wise convolution processing unit 504. For example, if the depth-wise convolution processing unit 504 accepts 8-bit unsigned weights such that the largest weight value accepted by the depth-wise convolution processing unit 504 is 255, then the per tensor addition scale factor m may be set to 255. Similarly, if the depth-wise convolution processing unit accepts 8-bit signed weights such that the largest weight value accepted by the depth-wise convolution processing unit is 127, then the per tensor addition scale factor m may be set to 127.
In other cases, a per tensor additional scale factor m may be selected to be the largest power of 2 weight accepted by the depth-wise convolution processing unit. For example, if the depth-wise convolution processing unit 504 accepts 8-bit unsigned weights such that the largest power of 2 weight accepted by the depth-wise convolution processing unit is 27=128, then the per tensor addition scale factor m may be set to 27=128. Where the high-precision scaling is implemented by the re-quantisation processing unit as an integer multiplication and a shift, this allows the inverse addition scale factor (1/m) to be reflected in the shift only.
In other cases, a per channel addition scale factor m may be selected. For example, a per channel addition scale factor m may be selected to be the m that minimizes the error shown in equation (8).
Since each channel has separate multiplication and addition parameters a and b, this may allow m to be selected for a channel such that
As discussed in more detail below, testing has shown that setting the addition scale factor m to any integer that is reasonably large (e.g. 16 or even 8) works well (i.e., achieves an accuracy equivalent to a floating point implementation of the affine transformation). Selecting a per channel addition scale factor m in accordance with equation (8) may, in some cases, improve the accuracy even more, particularly in more challenging neural networks, e.g. neural networks with low bit depths (e.g. 4 bits) or neural networks that are very sensitive to errors.
In some cases, the addition scale factor m (either per tensor or per channel) may be selected, or adjusted, so that the output of the bias addition unit 606 does not have to be clamped by the corresponding clamp unit 608. As described above, the clamp unit 608 is configured to clamp the output of the bias addition unit 606 to a desired integer range where the desired range may be based on, for example, which processing unit is to receive the output. So where the output is to be sent to the re-quantisation processing unit 506 then the clamp unit 608 may be configured to clamp the output to the integer range accepted by the re-quantisation processing unit 506. Accordingly, if the re-quantisation processing unit 506 accepts 32-bit inputs then ideally the output of the bias addition unit 606 is an integer within a 32-bit integer range so that it is output without clamping by the clamp unit 608. Therefore, the addition scale factor m may be selected so that the bias
A quantised affine transformation for a channel can be implemented as set out in equation (6) or equation (7) only if the multiplication parameter for that channel is not zero (i.e., a≠0). This is because when the multiplication parameter for a channel is zero (i.e., a=0) the implementations set out in equations (6) and (7) require a division by zero.
When the multiplication parameter for a channel is zero (i.e., a=0), then the affine transformation becomes y=ax+b=b. Therefore the output integer can be represented as shown in equation (9).
This can be implemented using the depth-wise convolution processing unit 504 and the re-quantisation processing unit 506 by setting the weight for the 1×1 convolution to be zero for that channel and setting the bias for that channel to be
for each input tensel of that channel. This allows the per-channel quantised affine transformation to be performed for the whole input tensor by a 1×1 depth-wise convolution performed by the depth-wise convolution processing unit and per channel scaling performed by the high-precision re-quantisation processing unit. The weight, bias and the re-quantisation factor are just selected differently for a channel wherein the multiplication parameter is zero. In some cases, the weight, bias and re-quantisation factor may be selected in this manner, not only for channels wherein the multiplication parameter is zero, but also for channels wherein the multiplication parameter is negligibly small, or substantially zero, such that the input would not affect the quantised output.
In some cases, the inputs and outputs of a per channel affine transformation may be in an asymmetric quantised number format (which may also be referred to as an affine quantised number format). An asymmetric quantised number format is a number format in which the range of values represented by the number format is not symmetric about zero, i.e., the range of values is asymmetric about zero. This may be implemented by using a zero point offset in addition to an integer value and a scale. For example, an input number x in an asymmetric quantised number format may be expressed using equation (10) wherein xq is an integer value, sx is the input scale factor, and zx is the input zero point offset. The zero point offset is often, but does not have to be, an integer value. An example asymmetric quantised number format is the Q8A number format.
x=(xq−zx)*sx (10)
Where the inputs of a per channel affine transformation are in an asymmetric quantised number format (e.g. as shown in equation (10)) the input zero point offset zx (the zero point offset associated with the input tensor) is to be taken into account when performing the affine transformation. As described above, some depth-wise convolution processing units, such as the depth-wise convolution processing unit of
Where, however, the depth-wise convolution processing unit 504 does not comprise hardware to remove or subtract a zero point offset from the input integers, the input zero point offset may be incorporated into the per-channel quantised affine transformation. To incorporate the input zero point offset into the per-channel quantised affine transformation, xq is replaced with (xq−zx) in the affine transformation equations expressed above. This results in equation (4) becoming equation (11), equation (6) becoming equation (12), and equation (7) becoming equation (13).
It can be seen that this effectively replaces b with b−azxsx in equations (4), (6) and (7). Accordingly, where the input zero point offset is incorporated into the affine transformation, the bias for each channel is set to
As described above, where a channel has a multiplication parameter that is equal, or substantially equal, to zero, the weight for that channel is set to zero such that the input integer is multiplied by zero, so an input zero point offset has no effect on the output of a channel that has a multiplication parameter that is equal, or substantially equal, to zero. Therefore no changes are made for a channel that has a multiplication parameter that is equal, or substantially equal, to zero to take into account an input zero point offset.
Where the output of a per channel affine transformation is to be in an asymmetric quantised number format as shown in equation (14), where yq is an integer value, sy is the output scale factor, and zy is the output zero point offset, the output zero point offset is to be taken into account when performing the affine transformation. As described above, some high-precision re-quantisation processing units, such as the high-precision re-quantisation processing unit of
y=(yq−zy)*sy (14)
Where, however, the high-precision re-quantisation processing unit 506 does not comprise hardware to add a zero point offset to the scaled value, the output zero point offset may be incorporated into the per-channel quantised affine transformation. To incorporate the output zero point offset into the per-channel quantised affine transformation, yq is replaced with (yq−zy) in the affine transformation equations expressed above. Specifically, equation (4) becomes equation (15), equation (6) becomes equation (16), and equation (7) becomes equation (17).
It can be seen that this effectively replaces b with b+zysy in equations (4), (6) and (7). Accordingly, where the output zero point offset is incorporated into the per channel quantised affine transformation, the bias for each channel is set to
While an input zero point offset does not affect the output for a channel with a multiplication parameter that is equal, or substantially equal, to zero, an output zero point offset does affect the output of such a channel. Since the output is controlled solely by the bias for that channel, the output zero point offset can be incorporated into the affine transformation for a channel that has a multiplication parameter that is equal, or substantially equal, to zero by using a bias of
In some cases, both the input zero point offset and the output zero point offset may be incorporated into the per channel quantised affine transformation. Equations (4), (6) and (7) then become equations (18), (19) and (20) as shown below.
It can be seen that this effectively replaces b with b+zysy−asxzx in equations (4), (6) and (7). Accordingly, where both the input zero point offset and the output zero point offset are incorporated into the per channel quantised affine transformation, the bias for each channel is set to
Since an input zero point offset does not affect the output, this implementation (taking into account both an input zero point offset and an output zero point offset) would be the same for a channel with a zero, or substantially, zero multiplication parameter as the case where only the zero point offset is taken into account. Specifically, an output zero point offset can be incorporated into the affine transformation for such a channel by using a bias of
Equations (6), (7), (12), (13), (16), (17), (19) and (20), which represent examples of the additions and multiplications performed for a channel with a non-zero multiplication parameter, can be generally represented as shown in equation (21) such that for each such channel the weight of the 1×1 depth-wise convolution is set to the addition scale factor m for the channel, the bias for the 1×1 depth-wise convolution is set to an integer that represents the ratio of (i) the product of the addition scale factor m for the channel and an addition variable R for the channel (which is based on the addition parameter b for the channel) and (ii) the product of the multiplication factor a for the channel and the input scale factor sx; and the channel re-quantisation factor for the scaling is set to the ratio of (i) the product of the multiplication parameter a for the channel and the input scale factor sx, and (ii) the product of the output scale factor sy and the addition scale factor m for the channel.
The addition scale factor m and the addition variable R may vary between implementations. For example, in some cases, expressed by equations (6), (12), (16), (19), the additional scale factor m is not used to scale up the bias value prior to rounding such that m is set to 1. In other cases, the addition scale factor m may be an integer greater than one to scale up the bias value prior to rounding. Also, depending on whether the input zero point offset and/or the output zero point offset are to be incorporated into the affine transformations, the addition variable R may be equal to: the addition parameter b (equations (6), (7)); b−asxzx (equations (12), (13)); b+zysy (equations (16), (17)); or b+zysy−asxzx (equations (19), (20)).
Accordingly, a per channel quantised affine transformation can be performed on an input tensor using a neural network accelerator, such as the neural network accelerator of
Reference is now made to
Specifically, at block 1202 the additions are implemented by performing, using the depth-wise convolution processing unit 504, a 1×1 depth-wise convolution on the input tensor with suitable per channel weights and biases to implement the additions. In some cases, the weight for a channel is set to the addition scale factor m for the channel and the bias for the channel is set to an integer addition value B for the channel.
As described above with respect to equation (21), the integer addition value B for a channel that has a non-zero multiplication parameter (i.e., a≠0) may be based on the addition parameter b for the channel, the addition scale factor m for the channel, the multiplication parameter a for the channel, and the common input scale sx.
For example, as described above with respect to equation (21), the integer addition value B for a channel that has a non-zero multiplication parameter (i.e., a≠0) may be an integer that represents a ratio of (i) a product of the addition scale factor m for the channel and an addition variable R for the channel, and (ii) a product of the multiplication parameter a for the channel and the common input scale sx, wherein the addition variable R for a channel is based on the addition parameter b for the channel
In some cases, the addition variable R is equal to the addition parameter b for that channel (e.g. R=b—see equation (6) and equation (7)). In other cases, the addition variable R may be a combination of the addition parameter b for that channel and one or more other values. For example, where the input to the per-channel affine transformation is in an asymmetric quantised number format such that each input tensel is represented by an integer value, a common input scale and a common input zero point, to account for the common input zero point the addition variable R for a channel may be a difference between the addition parameter b for the channel and the product of the multiplication parameter for that channel, the common input scale, and the common input zero point (see, for example, equations (12) and (13)). In another example, where the output of the per-channel quantised affine transformation is to be in an asymmetric quantised number format such that each output tensel is represented by an integer value, a common output scale, and a common output zero point, to account for the common output zero point the addition variable R for a channel may be the combination of the addition parameter b for the channel and the product of the common output scale sy and the common output zero point zy (see, for example, equations (16) and (17)). In yet another example, where the input to the per-channel quantised affine transformation is in an asymmetric quantised number format and the output of the per-channel quantised affine transformation is to be in an asymmetric quantised number format, to account for the common input zero point and the common output zero point, the addition variable R for a channel may be the combination of the addition parameter b for the channel, the product of the common output scale sy and the common output zero point zy, and the product of the multiplication parameter a for that channel, the common input scale sx, and the common input zero point zx (see, for example, equations (19) and (20)).
The addition scale factor m for a channel that has a non-zero multiplication parameter may be an integer greater than or equal to one. In some cases, the addition scale factor m for a channel is equal to one (see, for example, equation (6)). In other cases, the addition scale factor m for a channel may be greater than one to scale the bias so as to improve the accuracy of the rounding of the bias value (see, for example, equation (7)). In some cases, each channel has the same addition scale factor m. In other cases, at least two of the channels have different addition scale factors m.
As described above, in some cases, for a channel that has a zero, or substantially zero, multiplication parameter (i.e., a=0), the addition scale factor for that channel may be set to zero, the addition value B for that channel may be set to an integer that represents the ratio of the multiplication parameter b for that channel and the common output scale
In some cases, the per channel addition values B may be computed ‘offline’ (i.e. by a component external to the neural network accelerator) and provided to the neural network accelerator as inputs. The method 1200 then proceeds to block 1204.
At block 1204 each value output from the depth-wise convolution processing unit is scaled, using the high-precision re-quantisation processing unit, by a high-precision multiplication value A for the corresponding channel. The high-precision multiplication value A for a channel is based on the multiplication parameter a for the channel, the addition scale factor m for the channel, the common input scale sx and the common output scale sy.
As described above, in some cases the multiplication value A for a channel that has a non-zero multiplication parameter a is equal to the ratio of (i) the product of the multiplication parameter a for that channel and the common input scale, and (ii) the product of the common output scale sy and the addition scale factor m for that channel (see, for example, equations (6) and (7)). In some cases, the multiplication value A for a channel that has a zero multiplication parameter (or a substantially zero multiplication parameter) may be set to one.
The method 1200 may then end.
Although
Table 1 shows the Top-1 and Top-5 accuracy of the ResNet V2 50 neural network using the implementation and the weights from the TIMM library when (i) the neural network was implemented in floating point (labelled “Floating point” in the Table); (i) the inputs and weights of each layer were quantised but the stand-alone batch normalisation (BN) layers were implemented in floating point (e.g. using floating point multiplication and addition parameters and only rounded at the end) (labelled “Quantised (BNs in floating point)” in the Table); and (iii) the inputs and weights of each layer were quantised and the stand-alone batch normalisation layers were implemented as described herein with a variety of different addition scale factors m (labelled “Quantised (BNs emulated)” in the Table). As is known to those of skill in the art, a Top-n accuracy measures how often the top n predicted classifications of a classification neural network include the correct classification.
As it known to those of skill in the art, the Top-n accuracy measurement isn't perfect in that it's an estimate of the probability of a correct classification that is based on a limited number of samples from the data distribution. This effectively makes the accuracy measurement a noisy estimate. Thus accuracies within 0.1% of each other are deemed to be essentially equivalent. It can be seen from Table 1 that when the stand-alone batch normalisation layers of a neural network are implemented in the manner described above (i.e., using the depth-wise convolution processing unit and the high-precision re-quantisation processing unit) the accuracy of the neural network is, at a minimum, similar to when the stand-alone batch normalisation layers of the neural network are implemented with quantised inputs, but floating point parameters, which is what the described methods are designed to emulate. However, the accuracy can more closely match that achieved by implementing the batch normalisation layers with quantised inputs, but floating point parameters, through the use of an addition scale factor m of 8 or higher.
Table 2 illustrates the Top-1 and Top-5 for the same neural network and the same scenarios as for Table 1, except instead of only the stand-alone batch normalisation layers being implemented as described herein, all batch normalisation layers (e.g. even those batch normalisation layers that would, in many cases, be fused with a preceding convolution layer) are implemented as described herein (i.e., using the depth-wise convolution processing unit and the high-precision re-quantisation processing unit).
It can be seen from Table 2 that the results are similar even when the number of batch normalisation layers is increased. Therefore increasing the number of batch normalisation layers that are implemented in the manner described herein does not affect the accuracy of the neural network.
The neural network accelerators, depth-wise convolution processing units, and re-quantisation processing units of
The neural network accelerators, depth-wise convolution processing units, and re-quantisation processing units described herein may be embodied in hardware on an integrated circuit. The neural network accelerators described herein may be configured to perform any of the methods described herein. Generally, any of the functions, methods, techniques or components described above can be implemented in software, firmware, hardware (e.g., fixed logic circuitry), or any combination thereof. The terms “module,” “functionality,” “component”, “element”, “unit”, “block” and “logic” may be used herein to generally represent software, firmware, hardware, or any combination thereof. In the case of a software implementation, the module, functionality, component, element, unit, block or logic represents program code that performs the specified tasks when executed on a processor. The algorithms and methods described herein could be performed by one or more processors executing code that causes the processor(s) to perform the algorithms/methods. Examples of a computer-readable storage medium include a random-access memory (RAM), read-only memory (ROM), an optical disc, flash memory, hard disk memory, and other memory devices that may use magnetic, optical, and other techniques to store instructions or other data and that can be accessed by a machine.
The terms computer program code and computer readable instructions as used herein refer to any kind of executable code for processors, including code expressed in a machine language, an interpreted language or a scripting language. Executable code includes binary code, machine code, bytecode, code defining an integrated circuit (such as a hardware description language or netlist), and code expressed in a programming language code such as C, Java or OpenCL. Executable code may be, for example, any kind of software, firmware, script, module or library which, when suitably executed, processed, interpreted, compiled, executed at a virtual machine or other software environment, cause a processor of the computer system at which the executable code is supported to perform the tasks specified by the code.
A processor, computer, or computer system may be any kind of device, machine or dedicated circuit, or collection or portion thereof, with processing capability such that it can execute instructions. A processor may be or comprise any kind of general purpose or dedicated processor, such as a CPU, GPU, NNA, System-on-chip, state machine, media processor, an application-specific integrated circuit (ASIC), a programmable logic array, a field-programmable gate array (FPGA), or the like. A computer or computer system may comprise one or more processors.
It is also intended to encompass software which defines a configuration of hardware as described herein, such as HDL (hardware description language) software, as is used for designing integrated circuits, or for configuring programmable chips, to carry out desired functions. That is, there may be provided a computer readable storage medium having encoded thereon computer readable program code in the form of an integrated circuit definition dataset that when processed (i.e., run) in an integrated circuit manufacturing system configures the system to manufacture a neural network accelerator configured to perform any of the methods described herein, or to manufacture a neural network accelerator comprising any apparatus described herein. An integrated circuit definition dataset may be, for example, an integrated circuit description.
Therefore, there may be provided a method of manufacturing, at an integrated circuit manufacturing system, a neural network accelerator as described herein. Furthermore, there may be provided an integrated circuit definition dataset that, when processed in an integrated circuit manufacturing system, causes the method of manufacturing a neural network accelerator to be performed.
An integrated circuit definition dataset may be in the form of computer code, for example as a netlist, code for configuring a programmable chip, as a hardware description language defining hardware suitable for manufacture in an integrated circuit at any level, including as register transfer level (RTL) code, as high-level circuit representations such as Verilog or VHDL, and as low-level circuit representations such as OASIS (RTM) and GDSII. Higher level representations which logically define hardware suitable for manufacture in an integrated circuit (such as RTL) may be processed at a computer system configured for generating a manufacturing definition of an integrated circuit in the context of a software environment comprising definitions of circuit elements and rules for combining those elements in order to generate the manufacturing definition of an integrated circuit so defined by the representation. As is typically the case with software executing at a computer system so as to define a machine, one or more intermediate user steps (e.g., providing commands, variables etc.) may be required in order for a computer system configured for generating a manufacturing definition of an integrated circuit to execute code defining an integrated circuit so as to generate the manufacturing definition of that integrated circuit.
An example of processing an integrated circuit definition dataset at an integrated circuit manufacturing system so as to configure the system to manufacture a neural network accelerator will now be described with respect to
The layout processing system 1404 is configured to receive and process the IC definition dataset to determine a circuit layout. Methods of determining a circuit layout from an IC definition dataset are known in the art, and for example may involve synthesising RTL code to determine a gate level representation of a circuit to be generated, e.g., in terms of logical components (e.g., NAND, NOR, AND, OR, MUX and FLIP-FLOP components). A circuit layout can be determined from the gate level representation of the circuit by determining positional information for the logical components. This may be done automatically or with user involvement in order to optimise the circuit layout. When the layout processing system 1404 has determined the circuit layout it may output a circuit layout definition to the IC generation system 1406. A circuit layout definition may be, for example, a circuit layout description.
The IC generation system 1406 generates an IC according to the circuit layout definition, as is known in the art. For example, the IC generation system 1406 may implement a semiconductor device fabrication process to generate the IC, which may involve a multiple-step sequence of photo lithographic and chemical processing steps during which electronic circuits are gradually created on a wafer made of semiconducting material. The circuit layout definition may be in the form of a mask which can be used in a lithographic process for generating an IC according to the circuit definition. Alternatively, the circuit layout definition provided to the IC generation system 1406 may be in the form of computer-readable code which the IC generation system 1406 can use to form a suitable mask for use in generating an IC.
The different processes performed by the IC manufacturing system 1402 may be implemented all in one location, e.g., by one party. Alternatively, the IC manufacturing system 1402 may be a distributed system such that some of the processes may be performed at different locations, and may be performed by different parties. For example, some of the stages of: (i) synthesising RTL code representing the IC definition dataset to form a gate level representation of a circuit to be generated, (ii) generating a circuit layout based on the gate level representation, (iii) forming a mask in accordance with the circuit layout, and (iv) fabricating an integrated circuit using the mask, may be performed in different locations and/or by different parties.
In other examples, processing of the integrated circuit definition dataset at an integrated circuit manufacturing system may configure the system to manufacture a neural network accelerator without the IC definition dataset being processed so as to determine a circuit layout. For instance, an integrated circuit definition dataset may define the configuration of a reconfigurable processor, such as an FPGA, and the processing of that dataset may configure an IC manufacturing system to generate a reconfigurable processor having that defined configuration (e.g., by loading configuration data to the FPGA).
In some embodiments, an integrated circuit manufacturing definition dataset, when processed in an integrated circuit manufacturing system, may cause an integrated circuit manufacturing system to generate a device as described herein. For example, the configuration of an integrated circuit manufacturing system in the manner described above with respect to
In some examples, an integrated circuit definition dataset could include software which runs on hardware defined at the dataset or in combination with hardware defined at the dataset. In the example shown in
The implementation of concepts set forth in this application in devices, apparatus, modules, and/or systems (as well as in methods implemented herein) may give rise to performance improvements when compared with known implementations. The performance improvements may include one or more of increased computational performance, reduced latency, increased throughput, and/or reduced power consumption. During manufacture of such devices, apparatus, modules, and systems (e.g., in integrated circuits) performance improvements can be traded-off against the physical implementation, thereby improving the method of manufacture. For example, a performance improvement may be traded against layout area, thereby matching the performance of a known implementation but using less silicon. This may be done, for example, by reusing functional blocks in a serialised fashion or sharing functional blocks between elements of the devices, apparatus, modules and/or systems. Conversely, concepts set forth in this application that give rise to improvements in the physical implementation of the devices, apparatus, modules, and systems (such as reduced silicon area) may be traded for improved performance. This may be done, for example, by manufacturing multiple instances of a module within a predefined area budget.
The applicant hereby discloses in isolation each individual feature described herein and any combination of two or more such features, to the extent that such features or combinations are capable of being carried out based on the present specification as a whole in the light of the common general knowledge of a person skilled in the art, irrespective of whether such features or combinations of features solve any problems disclosed herein. In view of the foregoing description it will be evident to a person skilled in the art that various modifications may be made within the scope of the invention.
Number | Date | Country | Kind |
---|---|---|---|
2216151.7 | Oct 2022 | GB | national |