A Deep Neural Network (DNN) is a form of artificial neural network comprising a plurality of interconnected layers that can be used for machine learning applications. In particular, a DNN can be used in signal processing applications, including, but not limited to, image processing and computer vision applications.
Reference is made to
The processing that is performed on the input data to a layer depends on the type of layer. For example, each layer of a DNN may be one of a plurality of different types. Example DNN layer types include, but are not limited to, a convolution layer, an activation layer, a normalisation layer, a pooling layer, and a fully connected layer. It will be evident to a person of skill in the art that these are example DNN layer types and that this is not an exhaustive list and there may be other DNN layer types.
For a convolution layer, the input data is processed by convolving the input data with weights associated with that layer. Specifically, each convolution layer is associated with a plurality of weights w0 . . . wg, which may also be referred to as filter weights or coefficients. The weights are grouped to form, or define, one or more filters, which may also be referred to as kernels, and each filter may be associated with an offset bias bias. As shown in
An activation layer, which typically, but not necessarily follows a convolution layer, performs one or more activation functions on the input data to the layer. An activation function takes a single number and performs a certain non-linear mathematical operation on it. In some examples, an activation layer may act as rectified linear unit (ReLU) by implementing an ReLU function (i.e. ƒ(x)=max (0, x)) or a Parametric Rectified Linear Unit (PReLU) by implementing a PReLU function.
A normalisation layer is configured to perform a normalizing function, such as a Local Response Normalisation (LRN) function on the input data. A pooling layer, which is typically, but not necessarily inserted between successive convolution layers, performs a pooling function, such as a max or mean function, to summarise subsets of the input data. The purpose of a pooling layer is thus to reduce the spatial size of the representation to reduce the number of parameters and computation in the network, and hence to also control overfitting.
A fully connected layer, which typically, but not necessarily follows a plurality of convolution and pooling layers takes a three-dimensional set of input data values and outputs an N dimensional vector. Where the DNN is used for classification N may be the number of classes and each value in the vector may represent the probability of a certain class. The N dimensional vector is generated through a matrix multiplication of a set of weights, optionally followed by a bias offset. A fully connected layer thus receives a set of weights and a bias.
Accordingly, as shown in
Hardware (e.g. a DNN accelerator) for implementing a DNN comprises hardware logic that can be configured to process input data to the DNN in accordance with the layers of the DNN. Specifically, hardware for implementing a DNN comprises hardware logic that can be configured to process the input data to each layer in accordance with that layer and generate output data for that layer which either becomes the input data to another layer or becomes the output of the DNN. For example, if a DNN comprises a convolution layer followed by an activation layer, hardware logic that can be configured to implement that DNN comprises hardware logic that can be configured to perform a convolution on the input data to the DNN using the weights and biases associated with that convolution layer to produce output data for the convolution layer, and hardware logic that can be configured to apply an activation function to the input data to the activation layer (i.e. the output data of the convolution layer) to generate output data for the DNN.
As is known to those of skill in the art, for hardware to process a set of values each value is represented in a number format. Two common types of number formats are fixed point number formats and floating point number formats. As is known to those skilled in the art, a fixed point number format has a fixed number of digits after the radix point (e.g. decimal point or binary point). In contrast, a floating point number format does not have a fixed radix point (i.e. it can “float”). In other words, the radix point can be placed in multiple places within the representation. While representing the network parameters of a DNN in a floating point number format may allow more accurate or precise output data to be produced, processing network parameters in a floating point number format in hardware is complex which tends to increase the silicon area, power consumption, memory and bandwidth consumption, and complexity of the hardware compared to hardware that processes network parameters in other formats, such as, but not limited to, fixed point number formats. Accordingly, hardware for implementing a DNN may be configured to represent the network parameters of a DNN in another format, such as a fixed point number format, to reduce the area, power consumption, memory and bandwidth consumption, and complexity of the hardware logic.
Generally the fewer bits that are used to represent the network parameters of a DNN (e.g. input data values, weights, biases, and output data values), the more efficiently the DNN can be implemented in hardware. However, typically the fewer bits that are used to represent the network parameters of a DNN (e.g. input data values, weights, biases, and output data values) the less accurate the DNN becomes. Accordingly it is desirable to identify number formats for representing the network parameters of the DNN that balance the number of bits used to represent the network parameters and the accuracy of the DNN.
The embodiments described below are provided by way of example only and are not limiting of implementations which solve any or all of the disadvantages of methods and systems for identifying number formats for representing the network parameters of a DNN.
This summary is provided to introduce a selection of concepts that are further described below in the detailed description. This summary is not intended to identify key features or essential features of the claimed subject matter, nor is it intended to be used to limit the scope of the claimed subject matter.
Described herein are methods of determining a number format for representing a set of two or more network parameters of a Deep Neural Network “DNN” for use in configuring hardware logic to implement the DNN. The method includes: determining a sensitivity of the DNN with respect to each network parameter in the set of network parameters; for each candidate number format of a plurality of candidate number formats: determining a quantisation error associated with quantising each network parameter in the set of network parameters in accordance with the candidate number format; generating an estimate of an error in an output of the DNN caused by quantisation of the set of network parameters based on the sensitivities and the quantisation errors; generating a local error based on the estimated error; and selecting the candidate number format of the plurality of candidate number formats with the minimum local error as the number format for the set of network parameters.
A first aspect provides a computer-implemented method of determining a number format for representing a set of two or more network parameters of a Deep Neural Network “DNN” for use in configuring hardware logic to implement the DNN, the method comprising: determining a sensitivity of the DNN with respect to each network parameter in the set of network parameters; for each candidate number format of a plurality of candidate number formats: determining a quantisation error associated with quantising each network parameter in the set of network parameters in accordance with the candidate number format; generating an estimate of an error in an output of the DNN caused by quantisation of the set of network parameters based on the sensitivities and the quantisation errors; and generating a local error based on the estimated error; and selecting the candidate number format of the plurality of candidate number formats with the minimum local error as the number format for the set of network parameters.
Determining the sensitivity of the DNN with respect to a network parameter may comprise: determining an output of a model of the DNN in response to test data; determining a partial derivative of one or more values based on the output of the DNN with respect to the network parameter; and determining the sensitivity from the one or more partial derivatives.
The one or more partial derivatives may be determined by a back-propagation technique.
The model of the DNN may be a floating point model of the DNN.
The output of the DNN may comprise a single value and the one or more values based on the output of the DNN may comprise the single output value.
The output of the DNN may comprise a plurality of values and the one or more values based on the output of the DNN may comprise each of the plurality of output values.
The output of the DNN may comprise a plurality of values and the one or more values based on the output of the DNN may comprise a single summary value based on the plurality of output values.
The summary value may be a sum of the plurality of output values.
The summary value may be a maximum of the plurality of output values.
Generating the estimate of the error in the output of the DNN caused by quantisation of the set of network parameters may comprise calculating a weighted sum of the quantisation errors wherein the weight associated with a quantisation error for a network parameter is the sensitivity of the DNN with respect to that network parameter.
Generating the estimate of the error in the output of the DNN caused by quantisation of the set of network parameters may comprise calculating an absolute value of a weighted sum of the quantisation errors wherein the weight associated with a quantisation error for a network parameter is the sensitivity of the DNN with respect to that network parameter.
Generating the estimate of the error in the output of the DNN caused by quantisation of the set of network parameters may comprise: (i) calculating, for each network parameter in the set, the absolute value of the product of the quantisation error for that network parameter and the sensitivity of the DNN with respect to that network parameter; and (ii) calculating a sum of the absolute values.
Generating the estimate of the error in the output of the DNN caused by quantisation of the set of network parameters may comprise: (i) calculating, for each network parameter, the square of the quantisation error for that network parameter; (ii) calculating, for each network parameter, the product of the square of the quantisation error for that network parameter, and the absolute value of the sensitivity of the DNN with respect to that network parameter; and (iii) calculating a sum of the products.
Each candidate number format may be defined by a bit width and an exponent.
The plurality of candidate number formats may have the same bit width and different exponents.
Each candidate number format may be defined by a bit width. At least two of the candidate number formats may have different bit widths. The local error may be further based on a size parameter.
The size parameter may be based on a number of bits to represent the network parameters in the set when the network parameters in the set are quantised in accordance with the candidate number format.
The set of network parameters may be one of: all or a portion of input data values for a layer of the DNN; all or a portion of weights for a layer of the DNN; all or a portion of biases of a layer of the DNN; and all or a portion of output data values of a layer of the DNN.
The method may further comprise configuring hardware logic to implement the DNN using the selected number format by configuring the hardware logic to receive and process the set of network parameters in accordance with the selected number format.
The local error may be the estimated error or a combination of the estimated error and a size parameter, the size parameter reflecting a size of the network parameters in the set of network parameters when quantised in accordance with the candidate number format
A second aspect provides a method of determining number formats for representing network parameters of a Deep Neural Network “DNN” for use in configuring hardware logic to implement the DNN, the method comprising: dividing the network parameters of the DNN into a plurality of sets of network parameters, each set comprising two or more network parameters; and executing the method of the first aspect for each set of network parameters.
Each set of network parameters may comprise all or a portion of input data values to a layer of the DNN; all or a portion of biases to a layer of the DNN; or all or a portion of weights to a layer of the DNN.
A third aspect provides a computing-based device for determining a number format for representing a set of two or more network parameters of a Deep Neural Network “DNN” for use in configuring hardware logic to implement the DNN, the computing-based device comprising: at least one processor; and memory coupled to the at least one processor, the memory comprising computer readable code that when executed by the at least one processor causes the at least one processor to: determine a sensitivity of the DNN with respect to each network parameter in the set of network parameters; for each candidate number format of a plurality of candidate number formats: determine a quantisation error associated with quantising each network parameter in the set of network parameters in accordance with the candidate number format; generate an estimate of an error in an output of the DNN caused by quantisation of the set of network parameters based on the sensitivities and the quantisation errors; generate a local error based on the estimated error; and select the candidate number format with the minimum local error as the number format for the set of network parameters.
The hardware logic configurable to implement a DNN (e.g. DNN accelerator) may be embodied in hardware on an integrated circuit. There may be provided a method of manufacturing, at an integrated circuit manufacturing system, the hardware logic configurable to implement a DNN (e.g. DNN accelerator). There may be provided an integrated circuit definition dataset that, when processed in an integrated circuit manufacturing system, configures the system to manufacture the hardware logic configurable to implement a DNN (e.g. DNN accelerator). There may be provided a non-transitory computer readable storage medium having stored thereon a computer readable description of hardware logic configurable to implement a DNN (e.g. DNN accelerator) that, when processed in an integrated circuit manufacturing system, causes the integrated circuit manufacturing system to manufacture an integrated circuit embodying hardware logic configurable to implement a DNN (e.g. DNN accelerator).
There may be provided an integrated circuit manufacturing system comprising: a non-transitory computer readable storage medium having stored thereon a computer readable description of hardware logic configurable to implement a DNN (e.g. DNN accelerator); a layout processing system configured to process the computer readable description so as to generate a circuit layout description of an integrated circuit embodying the hardware logic configurable to implement a DNN (e.g. DNN accelerator); and an integrated circuit generation system configured to manufacture the hardware logic configurable to implement a DNN (e.g. DNN accelerator) according to the circuit layout description.
There may be provided computer program code for performing a method as described herein. There may be provided non-transitory computer readable storage medium having stored thereon computer readable instructions that, when executed at a computer system, cause the computer system to perform the methods as described herein.
The above features may be combined as appropriate, as would be apparent to a skilled person, and may be combined with any of the aspects of the examples described herein.
Examples will now be described in detail with reference to the accompanying drawings in which:
The accompanying drawings illustrate various examples. The skilled person will appreciate that the illustrated element boundaries (e.g., boxes, groups of boxes, or other shapes) in the drawings represent one example of the boundaries. It may be that in some examples, one element may be designed as multiple elements or that multiple elements may be designed as one element. Common reference numerals are used throughout the figures, where appropriate, to indicate similar features.
The following description is presented by way of example to enable a person skilled in the art to make and use the invention. The present invention is not limited to the embodiments described herein and various modifications to the disclosed embodiments will be apparent to those skilled in the art. Embodiments are described by way of example only.
As described above, while representing the network parameters of a DNN in a floating point number format may allow more accurate or precise output data to be produced by the DNN, processing network parameters in a floating point number format in hardware is complex which tends to increase the silicon area, power consumption, memory and bandwidth consumption, and complexity of the hardware compared to hardware that processes network parameters in other formats, such as, but not limited to, fixed point number formats. Accordingly, hardware for implementing a DNN, such as a DNN accelerator, may be configured to represent and process the network parameters of a DNN in another number format, such as a fixed point number format, to reduce the area, power consumption, memory and bandwidth consumption, and complexity of the hardware logic.
There are a plurality of different types of number formats. Each number format type defines the parameters that form a number format of that type and how the parameters are interpreted. For example, one example number format type may specify that a number or value is represented by a b-bit mantissa m and an exponent exp and the number is equal to m*2exp. As described in more detail below, some number format types can have configurable parameters, which may also be referred to as quantisation parameters, that can vary between number formats of that type. For example, in the example number format described above the bit width b and the exponent exp may be configurable. Accordingly, a first number format of that type may use a bit width b of 4 and an exponent exp of 6, and a second, different, number format of that type may use a bit width b of 8 and an exponent exp of −3.
Generally, the fewer bits that can be used to represent the network parameters of a DNN (e.g. input data values, weights, biases, and output data values), the more efficiently the DNN can be implemented in hardware. However, typically the fewer bits that are used to represent the network parameters of a DNN (e.g. input data values, weights, biases, and output data values) the less accurate the DNN becomes. Accordingly it is desirable to identify number formats for representing the network parameters of the DNN that balance the number of bits used to represent the network parameters and the accuracy of the DNN.
The accuracy of a quantised DNN (i.e. a version of the DNN in which at least a portion of the network parameters are represented by a non-floating point number format) may be determined by comparing the output of such a DNN in response to input data to a baseline or target output. The baseline or target output may be the output of an unquantised version of the DNN (i.e. a version of the DNN in which all of the network parameters are represented by a floating point number format, which may as be referred to herein as a floating point version of the DNN or a floating point DNN) in response to the same input data or the ground truth output for the input data. The further the output of the quantised DNN is from the baseline or target output, the less accurate the quantised DNN. The size of a quantised DNN may be determined by the number of bits used to represent the network parameters of the DNN. Accordingly, the lower the bit depths of the number formats used to represent the network parameters of a DNN, the smaller the DNN.
While all the network parameters (e.g. input data values, weight, biases and output data values) of a DNN may be represented using a single number format this does not generally produce a DNN that is small in size and accurate. This is because different layers of a DNN tend to have different ranges of values. For example, one layer may have input data values between 0 and 6 whereas another layer may have input data values between 0 and 500. Accordingly, using a single number format may not allow either set of input data values to be represented efficiently or accurately. As a result, the network parameters of a DNN may be divided into sets of network parameters and a number format may be selected for each set. Preferably each set of network parameters comprises related or similar network parameters. As network parameters of the same type for the same layer tend to be related, each set of network parameters may be all or a portion of a particular type of network parameter for a layer. For example, each set of network parameters may be all, or a portion of the input data values of a layer; all or a portion of the weights of a layer; all or a portion of the biases of a layer; or all or a portion of the output data values of a layer. Whether or not a set of network parameters comprises all, or only a portion, of the network parameters of a particular type for a layer may depend on the hardware that is to implement the DNN. For example, some hardware that can be used to implement a DNN may only support a single number format per network parameter type per layer, whereas other hardware that can be used to implement a DNN may support multiple number formats per network parameter type per layer.
Hardware for implementing a DNN, such as a DNN accelerator, may support one type of number format for the network parameters. For example, hardware for implementing a DNN may support number formats wherein numbers are represented by a b-bit mantissa and an exponent exp. To allow different sets of network parameters to be represented using different number formats hardware for implementing a DNN may use a type of number format that has one or more configurable parameters, wherein the parameters are shared between all values in a set. These types of number formats may be referred to herein as block-configurable types of number formats or set-configurable types of number formats. Accordingly, non-configurable formats such as INT32 and floating point number formats are not block-configurable types of number formats. Example block-configurable types of number formats are described below.
One example block-configurable type of number format which may be used to represent the network parameters of a DNN is the Q-type format, which specifies a predetermined number of integer bits a and fractional bits b. Accordingly, a number can be represented as Qa.b which requires a total of a+b+1 bits (including the sign bit). Example Q-type formats are illustrated in Table 1 below. The quantisation parameters for the Q-type format are the number of integer bits a and the number of fractional bits b.
Another example block-configurable type of number format which may be used to represent network parameters of a DNN is one in which number formats of this type are defined by a fixed integer exponent exp and a b-bit mantissa m such that a value u is equal to u=2expm. In some cases, the mantissa m may be represented in two's complement format. However, in other cases other signed or unsigned integer formats may be used. In these cases, the exponent exp and the number of mantissa bits b only need to be stored once for a set of values represented in that number format. Different number formats of this type may have different mantissa bit lengths b and/or different exponents exp thus the quantisation parameters for this type of number format comprise the mantissa bit length b (which may also be referred to herein as a bit width, bit depth or bit length), and the exponent exp.
A final example block-configurable type of number format which may be used to represent the network parameters of a DNN is the 8-bit asymmetric fixed point (Q8A) type format. In one example, number formats of this type comprise a minimum representable number rmin, a maximum representable number rmax, a zero point z, and an 8-bit number dQ8A for each value in a set which identifies a linear interpolation factor between the minimum and maximum representable numbers. In other cases, a variant of this type of format may be used in which the number of bits used to store the interpolation factor dQbA is variable (e.g. the number of bits b used to store the interpolation factor may be one of a plurality of possible integers). In this example, the Q8A type format or a variant of the Q8A type format may approximate a floating point value dfloat as shown in equation (1) where b is the number of bits used by the quantised representation (i.e. 8 for the Q8A format) and z is the quantised zero point which will always map exactly back to 0. The quantisation parameters for this example type of number format comprise the maximum representable number or value rmax, the minimum representable number or value rmin, the quantised zero point z, and optionally, the mantissa bit length b (i.e. when the bit length is not fixed at 8).
In another example, the Q8A type format comprises a zero point z which will always map exactly to 0, a scale factor scale and an 8-bit number dQ8A for each value in the set. In this example a number format of this type approximates a floating point value dfloat as shown in equation (2). Similar to the first example Q8A type format, in other cases the number of bits for the integer or mantissa component may be variable. The quantisation parameters for this example type of number format comprise the zero point z, the scale scale, and optionally, the mantissa bit length b.
d
float=(dQ8A−z)*scale (2)
Determining a number format of a specific block-configurable type of number format may be described as identifying the one or more quantisation parameters for the type of number format. For example, determining a number format of a number format type defined by a b-bit mantissa and an exponent exp may comprise identifying the bit width b of the mantissa and/or the exponent exp.
Several methods have been developed for identifying number formats for representing network parameters of a DNN. One simple method (which may be referred to herein as the full range method or the minimum/maximum method) for selecting a number format for representing a set of network parameters of a DNN may comprise selecting, for a given mantissa bit depth b (or a given exponent exp), the smallest exponent exp (or smallest mantissa bit depth b) that covers the range for the expected set of network parameters x for a layer. For example, for a given mantissa bit depth b, the exponent exp can be chosen in accordance with equation (3) such that the number format covers the entire range of x where ┌.┐ is the ceiling function:
exp=┌log2(max(|x|))┐−b+1 (3)
However, such a method is sensitive to outliers. Specifically, where the set of network parameters x has outliers, precision is sacrificed to cover the outliers. This may result in large quantisation errors (e.g. the error between the set of network parameters in a first number format (e.g. floating point number format) and the set of network parameters in the selected number format). As a consequence, the error in the output data of the layer and/or of the DNN caused by the quantisation, may be greater than if the number format covered a smaller range, but with more precision.
Another method (which may be referred to as the weighted outlier method) is described in the Applicant's GB Patent Application No. 1718293.2, which is herein incorporated by reference in its entirety. In the weighted outlier method the number format for a set of network parameters is selected from a plurality of potential number formats based on the weighted sum of the quantisation errors when a particular number format is used, wherein a constant weight is applied to the quantisation errors for network parameters that fall within the representable range of the number format and a linearly increasing weight is applied to the quantisation errors for the values that fall outside the representable range.
Yet another method (which may be referred to as the back-propagation method) is described in the Applicant's GB Patent Application No. 1821150.8, which is herein incorporated by reference in its entirety. In the back-propagation method the quantisation parameters that produce the best cost (e.g. a combination of DNN accuracy and DNN size (e.g. number of bits)) are selected by iteratively determining the gradient of the cost with respect to each quantisation parameter using back-propagation, and adjusting the quantisation parameters until the cost converges. This method can produce good results (e.g. a DNN that is small in size (in terms of number of bits), but is accurate), however it can take a long time to converge.
Finally, another method (which may be referred to as the end-to-end method) is described in the Applicant's GB Patent Application No. 1718289.0, which is herein incorporated by reference in its entirety. In the end-to-end method the number formats for the network parameters of a DNN are selected one layer at a time according to a predetermined sequence wherein any layer is preceded in the sequence by the layer(s) on which it depends. The number format for a set of network parameters for a layer is selected from a plurality of possible number formats based on the error in the output of the DNN when each of the plurality of possible number formats is used to represent the set of network parameters. Once the number format(s) for a layer has/have been selected any calculation of the error in the output of the DNN for a subsequent layer in the sequence is based on the network parameters of that layer being represented using the selected number format(s). This may be quicker (e.g. it may produce a set of number formats for a DNN faster) than the back-propagation method, but it is not quite as accurate although it is more accurate than the minimum/maximum method and the weighted outlier method.
These methods can be divided into two groups—those, such as the minimum/maximum method and the weighted outlier method, that are easy to implement and can identify a set of number formats for the network parameters of a DNN quickly, but may provide sub-optimal results in terms of size and accuracy; and those, such as the back-propagation method and the end-to-end method, that are more complex to implement and take more time to identify a set of number formats for the network parameters of DNN, but produce a better DNN (e.g. a DNN that is small in size but accurate). Accordingly, there is a need for a method of selecting number formats for the network parameters of a DNN that can produce a set of number formats quickly, but can also produce a good DNN (e.g. a DNN that is small in size but accurate).
Accordingly, described herein are methods and systems for identifying a number format for representing a set of network parameters of a DNN wherein the number format is selected as the candidate number format of a plurality of candidate number formats that minimizes a local error. The local error is based on an estimate of the error in the output of the DNN caused by quantisation of the set of network parameters, wherein the estimate of the error in the output of the DNN caused by the quantisation of the set of network parameters is based on the quantisation error of each network parameter in the set and the sensitivity of the DNN to each of the network parameters in the set. As described in more detail below, the sensitivity of the DNN to a particular network parameter indicates the importance, influence, or significance of the particular network parameter to the output of the DNN, and is therefore an indication of how much a perturbation of a particular network parameter is likely to affect the error in the output of the DNN.
Estimating the error in the output of the DNN caused by, or attributed to, quantisation of a set of network parameters based on sensitivity and quantisation error has proved to be an accurate method of estimating the error. In particular, in general the higher the magnitude of the quantisation error for a network parameter the greater the error in the output of the DNN (and thus the poorer the accuracy of the DNN). However, not all network parameters contribute to the output equally. Specifically, some network parameters will have more effect on the output than other network parameters. Accordingly, estimating the error in the output of the DNN caused by, or attributed to, quantisation of a set of network parameters from both sensitivity and quantisation errors, instead of solely from the quantisation errors, can produce a more accurate estimate of the error associated with quantising the set of network parameters.
For example, reference is now made to
Accordingly it can be seen from
Furthermore, estimating the error in the output of the DNN caused by quantisation of the set of network parameters based on quantisation error and sensitivity means that, unlike the end-to-end and back-propagation methods, the output of the DNN does not have to be determined or evaluated multiple times. Specifically, the sensitivity can be determined from a single forward pass of the DNN and, as described in more detail below, from a single backward pass. Accordingly, the described methods allow number formats to be identified quickly and efficiently.
Error Estimated from Sensitivity and Quantisation Error
An explanation will now be provided as to why the error in the output of the DNN related to quantisation of a set of network parameters to a particular number format can be accurately estimated using the sensitivity of the DNN with respect to the network parameters in the set and the quantisation error associated with quantising the network parameters to the particular number format. Specifically, without loss of generality, let a differentiable function ƒ(x) represent the DNN. By the first order Taylor series expansion an approximation of the output of the function after a small perturbation Δx of the input x is given by equation (4):
Rearranging equation (4), an approximation of the size or magnitude of the perturbation in the output is given by equation (5):
Where the function is a function of multiple variables with multiple outputs equation (5) becomes equation (6) where the total size of the perturbation in the jth output related to a perturbation of a set of variables is given by the sum of the perturbations in the jth output caused by each variable xi:
As is known to those of skill in the art (and described in more detail below), quantisation rounds a network parameter x in a first number format to a representable number q (x, F) of another number format F. The number format F is defined by one or more quantisation parameters. As described above, different types of number formats may be defined by different quantisation parameters. For example, as described above, a Q-type format is defined by the number of integer bits and the number of fractional bits; and another format type may be defined by an exponent exp and a bit width b. Quantisation introduces an error between the original network parameter x and the quantised network parameter q (x, F) which can be considered a perturbation of the original value as shown in equation (7):
Δx=q(x,F)−x (7)
Then from equation (6) the estimate of the error in the jth output of a DNN caused by the quantisation of a set of N network parameters can be written as shown in equation (8).
Accordingly an estimate of the error in the jth output of a DNN caused by the quantisation of a set of network parameters to a number format F can be determined from (i) the quantisation error (q(xi, F)−xi) associated with quantising each of the network parameters in the set to the number format; and (ii) the partial derivative of the jth output with respect to each of the values in the set
The partial derivative of a function with respect to a variable or value may also be referred to as the gradient of the function with respect to the variable or value.
The total error in the output caused by quantisation of a set of network parameters may then be estimated as the sum of the error in each output caused by the quantisation of a set of network parameters as shown in equation (9):
Calculating the partial derivatives in equation (9) for each value in a set amounts to calculating the Jacobian matrix J of the function f which is shown in equation (10). As is known to those of skill in the art, the Jacobian matrix of a function of multiple variables with multiple outputs is the matrix of all its first order partial derivatives. In some cases, it may be difficult to efficiently calculate the full Jacobian matrix due to its computation and memory requirements, particularly for DNNs with a large number of outputs (e.g. 1,000 outputs or more). Definitions of sensitivity analogous to those presented below may be based on an explicit calculation of the Jacobian matrix; however, for reasons of efficiency and practicality it is often preferable to summarise it in some manner.
One such method of avoiding computation of J is to rearrange equation (9) as shown in equation (11), and defining the sensitivity si of the network parameter xi with respect to the network outputs as the sum of partial derivatives as shown in equation (12). This is an example of a use of a summary S of the network outputs; ƒ has here been summarised by the summation S=Σjƒj so that
which leads to equation (13).
S may be defined in any suitable manner from the outputs fj of the DNN. In some cases, S may be the sum of the outputs of the DNN as shown in equation (14). The advantage of calculating S as set out in equation (14) is that S takes into account all of the outputs of the DNN. However, calculating S as set out in equation (14) may not work well, for example, where the outputs of the DNN (e.g. SoftMax outputs) are normalized such that all the outputs always sum to a constant. In such cases, the theoretical gradient of S is 0. Accordingly, in other cases, S may be the maximum of the outputs as shown in equation (15). This method of calculating S avoids the issue with normalized outputs that equation (14) has and has proven to produce good results (e.g. a DNN that is small in size, but accurate) for classification networks in particular. However, calculating S in accordance with equation (15) may not be suitable for DNNs where the output is not dominated by the largest output value, such as, but not limited to, image regression DNNs. It will be evident to a person of skill in the art that these are example methods of calculating S and that S may be calculated in any suitable manner from the outputs ƒj of the DNN.
S=Σj ƒj (14)
S=maxj ƒj (15)
Accordingly, to minimise the error in the output of the DNN due to quantisation of a set of network parameters the best number format to quantise the set of network parameters can be selected as the number format that minimises a local error E that is based on an estimate G of the error in the output, wherein the estimate G is based on the quantisation error and sensitivity of the network parameters in the set. The local error E can be expressed as shown in equation (16) and the selection of the number format as the number format that minimises the local error E is expressed in equation (17):
E(F)=G(q(x,F)−(x), s(x)) (16)
F*=argminF E(F) (17)
The estimate of the error G may be calculated in any suitable manner from the quantisation errors and the sensitivities. For example, the estimated error G may be calculated in accordance equation (13) or a variant thereof. In some examples, the estimate of the error G may be calculated as the absolute value of the error estimate calculated in accordance with equation (13). In other words, G may be equal to the absolute value of the weighted sum of the quantisation errors, wherein the weight for a particular quantisation error is equal to the sensitivity of the DNN to the corresponding network parameter This is expressed by equation (18). In other examples, the estimate of the error G may be calculated by (i) calculating, for each network parameter in the set, the absolute value of the product of the quantisation error for that network parameter and the sensitivity of the DNN with respect to that network parameter; and (ii) calculating the sum of the absolute values. This is expressed by equation (19). In yet other examples, the estimate of the error G may be calculated by (i) calculating, for each network parameter, the square of the quantisation error for that network parameter; (ii) calculating, for each network parameter, the product of the square of the quantisation error for that network parameter, and the absolute value of the sensitivity of the DNN with respect to that network parameter; and (iii) calculating the sum of the products. This is expressed in equation (20). Testing has shown that calculating the estimated error G in accordance with equation (20) works well for many DNNs. It will be evident to a person of skill in the art that these are examples only and the estimate of the error G may be calculated from the quantisation errors and sensitivities for the network parameters in any suitable manner.
G=|Σ
i=1
N((q(xi,F)−xi)*si)| (18)
G=Σ
i=1
N|(q(xi,F)−xi)*si| (19)
G=Σ
i=1
N(q(xi,F)−xi)2*|si| (20)
Where the bit width varies between candidate number formats the local error E may be modified to include an additional term that penalises number formats with large bit depths. For example, in some cases, as shown in equation (21) the local error E may be amended to include a size parameter B which reflects the size of the network parameters when using a particular candidate number format. For example, in some cases, B may be a positive value based on the number of bits to represent the network parameters when using a particular candidate number format. Since the quantisation error, and thus the estimated error G, can always be reduced by increasing the bit width, the bit width that produces the best, or minimum, G will typically be the number format with the largest bit width. However, larger bit widths increase the size of the DNN which increases the costs to implement the DNN. Accordingly, by adding the additional term to the local error E that penalises large bit depths a number format that balances size and accuracy will be selected.
E(F)=G(q(x,F)−q(x), s(x))+B(F) (21)
Reference is now made to
As described above with respect to
The method 800 begins at block 802 where the sensitivity of the DNN with respect to each of the network parameters in the set is determined. As described above, the sensitivity of the DNN with respect to a network parameter is a measure of the importance, significance, or relevance of a network parameter to the output of the DNN. In some cases, determining the sensitivity of the DNN with respect to each of the network parameters may comprise determining the output of a model of the DNN in response to input data; determining the partial derivative of one or more values based on the output of the DNN with respect to each of the network parameters in the set; and calculating the sensitivity for each network parameter based on the partial derivative(s) for that network parameter.
A model of a DNN is a representation of the DNN that can be used to determine the output of the DNN in response to input data. The model may be, for example, a software implementation of the DNN or a hardware implementation of the DNN. As shown in
In some cases, the model may be a floating point model of the DNN (i.e. a model of the DNN in which the network parameters of the DNN are represented using floating point number formats). Since values can generally be represented more accurately, or more precisely, in a floating point number format a floating point model of the DNN represents a model of the DNN that will produce the most accurate output. Accordingly, the output generated by a floating point model of the DNN may be used to determine the sensitivity of the DNN to each of the network parameters.
In some cases, the output of a DNN may comprise a single value ƒ. In these cases the partial derivative of the output with respect to each of the network parameters in the set may be calculated and the partial derivative for a network parameter may be used as the sensitivity of the DNN with respect to that network parameter. For example, where a DNN produces a single output ƒ and there are three network parameters in the set x1, x2 and x3 then
are calculated and
is used as teh sensitivity of the DNN with respect to x1 (i.e.
is used as teh sensitivity of the DNN with respect to x2 (i.e.
is used as the sensitivyt of the DNN with respect to x3 (i.e.
In other cases, the output of the DNN (such as a classification DNN) may comprise multiple values ƒ1, ƒ2, . . . ƒM. In these cases, the partial derivative of each of the outputs with respect to each of the network parameters in the set may be calculated (e.g. a Jacobian matrix may be calculated) and the sensitivity for a network parameter may be a combination of the partial derivatives for the network parameter. For example, the sensitivity of the DNN with respect to a particular network parameter may be calculated as the sum of the partial derivatives as set out in equation (12). Alternatively a single value S, which may be referred to as the representative output value or the summary value, may be generated from the plurality of output values ƒ1, ƒ2, ƒM and the partial derivative of the representative output value S with respect to each of the network parameters
may be calculated.
The partial derivative for a network parameter may be used as the sensitivity of the DNN with respect to the network parameter. The representative output value, or the summary value, S may be calculated from the plurality of output values in any suitable manner. For example, the representative output value S may be equal to the sum of the outputs as set out in equation (14) or the representative output value S may be the maximum of the outputs as set out in equation (15).
In some cases, the partial derivatives may be calculated using back-propagation. As is known to those of skill in the art, back-propagation (which may also be referred to as backward propagation of errors) is a technique that may be used as part of an optimisation algorithm to train a DNN. Training a DNN comprises identifying the appropriate weights to configure the DNN to perform a specific function. Back-propagation works by computing the partial derivative of an error function with respect to a network parameter by the chain rule, computing the gradient one layer at a time, iterating backwards from the last layer.
The partial derivative of an output, or a representative/summary value for an output or set of outputs, with respect to any network parameter can be generated via back-propagation. For example,
The magnitude of the gradient of an output ƒ, or a representative/summary S of an output or a set of outputs, with respect to a particular network parameter
indicates whether quantisation of the network parameter will have a significant impact on the output of the DNN. Specifically, the higher the magnitude of the gradient, the greater the effect the quantisation of the network parameter has on the output(s); and the lower the magnitude of the gradient, the less effect the quantisation of the network parameter has on the output(s). As shown in
Once the sensitivity of the DNN with respect to each network parameter in the set has been determined the method 800 proceeds to block 804.
At block 804, for each candidate number format of a plurality of candidate number formats, the quantisation error associated with quantising each network parameter in the set in accordance with that candidate number format is determined. In some cases, the plurality of candidate number formats may comprise all possible candidate number formats of a particular type of number format. For example, if a number format type is defined by an exponent exp and a bit width b and the exponent exp can be 0 or 1 and the bit width b can be 2, 3 or 4 then the candidate number formats may comprise all possible combinations of exponents exp and bits widths b—e.g. a number format defined by exponent of 0 and a bit width of 2, a number format defined by an exponent of 0 and a bit width of 3, a number format defined by an exponent of 0 and a bit width of 4, a number format defined by an exponent of 1 and a bit width of 2, a number format defined by an exponent of 1 and a bit width of 3, and a number format defined by an exponent of 1 and a bit width of 4.
In other cases, the candidate number formats may comprise only a subset of the possible number formats of a particular number format type. For example, in some cases, all of the candidate number formats may have the same value for one quantisation parameter and different values for another quantisation parameter. In this way the method 800 may be used to select the value for one of the quantisation parameters. The value(s) for the other quantisation parameter(s) may be selected in any suitable manner. For example, where the number formats are defined by an exponent exp and a bit width b the candidate number formats may all have the same bit width b but different exponents exp; or the candidate number formats may all have the same exponent exp but different bit widths b. In some cases, the candidate number formats may be selected from the possible number formats using one or more criteria. For example, if one of the quantisation parameters is an exponent exp, the maximum/minimum method may be used to provide an upper bound on the exponent exp and the candidate number formats may only comprise number formats with exponents exp less than or equal to the upper bound. For example, if an exponent may be any integer from 1 to 5, and the upper bound as determined from, for example, the minimum/maximum method is 3, then the plurality of candidate number formats may comprise number formats with exponents of 1, 2 and 3 only.
For each of the possible candidate number formats a quantisation error is determined for each network parameter in the set. For example, if there are four candidate number formats each defined by a bit width b and an exponent exp—F0 (b=8, exp=0), F1(b=8, exp=1), F2(b=8, exp=2) and F3 (b=8, exp=4)—each network parameter may be quantised four times, once in accordance with the first number format F0 defined by a bit width of 8 and an exponent of 0, once in accordance with the second number format F1 defined by a bit width of 8 and an exponent of 1, once in accordance with the third number format F2 defined by a bit width of 8 and an exponent of 2, and once in accordance with the fourth number format F3 defined by a bit width of 8 and an exponent of 3. The quantisation error ei,k associated with quantising each network parameter xi in accordance with each candidate number format Fk is then determined. Accordingly, each candidate number format is associated with three quantisation errors, one for each network parameter as shown in Table 2.
As is known to those of skill in the art, quantisation is the process of converting a number in a higher precision number format to a lower precision number format. Quantising a number in a higher precision format to a lower precision format generally comprises selecting one of the representable numbers in the lower precision format to represent the number in the higher precision format based on a particular rounding mode (such as, but not limited to round to nearest (RTN), round to zero (RTZ), ties to even (RTE), round to positive infinity (RTP), and round to negative infinity (RTNI)).
For example, equation (22) sets out an example formula for quantising a value h in a first number format into a value q(h, F) in a second, lower precision, number format F where Xmax is the highest representable number in the second number format, Xmin is the lowest representable number in the second number format, and RND(h) is a rounding function:
The formula set out in equation (22) quantises a value h in a first number format to one of the representable numbers in the second number format F, wherein the representable number in the second number format F is selected based on the rounding mode RND (e.g. RTN, RTZ, RTE, RTP or RTNI).
In the examples described herein, the lower precision format is a block-configurable type of number format and the higher precision format may be any number format (although it is often a floating point number format). In other words, each network parameter is initially in a first number format (e.g. a floating point number format), and is quantised to a lower precision block-configurable type number format.
In some cases, the quantisation error ei,k for a network parameter xi for a specific candidate number format Fk may be calculated as the difference between the initial network parameter xi in an initial format (e.g. in a floating point number format) and the initial network parameter quantised in accordance with the candidate number format q(xi, Fk) as shown in equation (23).
e
i,k
=q(xi, Fk)−xi (23)
Once a quantisation error ei,k has been determined for each network parameter, for each candidate number format the method 800 proceeds to block 806.
At block 806, for each candidate number format, an estimate of the error G in the output of the DNN caused by quantisation of the set of network parameters is generated based on the sensitivities si calculated in block 802 and the quantisation errors ei,k associated with that candidate number format calculated in block 804. Table 3 illustrates, for the example described above with respect to Table 2 where there are three network parameters in the set x0, x1, x2 and there are four candidate number formats F0, F1, F2, F3, the relevant quantisation errors ei,k and sensitivities si for generating the error estimate G for each candidate number format.
The estimate of the error G for a candidate number format may be generated in any suitable manner from the relevant quantisation errors and the sensitivities. In one example the estimate of the error G for a candidate number format may be calculated as the weighted sum of the relevant quantisation errors where the weight of a quantisation error for a network parameter is the sensitivity of the DNN for that network parameter. This is expressed in equation (13). In another example, the estimate of the error G may be calculated as the absolute value of the weighted sum of the quantisation errors, wherein the weight for a quantisation error for a network parameter is equal to the sensitivity of the DNN for that network parameter This is expressed by equation (18). In another example, the estimate of the error G may be calculated by (i) calculating, for each network parameter in the set, the absolute value of the product of the quantisation error for that network parameter and the sensitivity of the DNN with respect to that network parameter; and (ii) calculating the sum of the absolute values. This is expressed in equation (19). In yet another example, the estimate of the error G may be calculated by (i) calculating, for each network parameter, the square of the quantisation error for that network parameter; (ii) calculating, for each network parameter, the product of the square of the quantisation error for that network parameter, and the absolute value of the sensitivity of the DNN with respect to that network parameter; and (iii) calculating the sum of the products. This is expressed in equation (20). As described above, testing has shown that calculating the estimated error G in accordance with equation (20) works well for many DNNs. It will be evident to a person of skill in the art that these are examples only and the estimate of the error G may be calculated from the quantisation errors and sensitivities in any suitable manner.
Once an estimate of the error G has been generated for each candidate number format the method 800 proceeds to block 808.
At block 808, for each candidate number format, a local error E is generated based on the corresponding error estimate G. In some cases (e.g. when the candidate number formats have the same bit depth) the local error may be equal to the error estimate G. In other cases, the local error E may be a combination of the estimated error G and one or more other parameters or terms. For example, as shown in equation (21), when the candidate number formats have different bit widths, the local error E may be amended to include a size parameter or term B which reflects the size of the network parameters when using a particular candidate number format. For example, in some cases, B may be a positive value based on the number of bits to represent the network parameters using the candidate number format. Since the quantisation error, and thus the estimated error G, can always be reduced by increasing the bit width, without the size term the number format that produces the best, or minimum, G will likely be the number format with the largest bit width. Accordingly, by adding the additional term to the local error E that penalises large bit depths a number format that balances size and accuracy will be selected.
Once the local error E has been generated for each candidate number format the method 800 proceeds to block 810.
At block 810, the candidate number format that has the lowest local error E is selected as the number format for the set of network parameters. For example, in the example described above with respect to Tables 2 and 3 where there are three network parameters x0, x1, x2 and four candidate number formats F0, F1, F2, F3 and the first candidate number format F0 has the smallest local error E then the first candidate number format F0 may be selected as the number format for the set of network parameters. After one of the candidate number formats has been selected based on the local errors E associated therewith the method 800 may end or the method 800 may proceed to block 812 and/or block 814.
At block 812, the selected number format is output for use in configuring hardware logic (e.g. DNN accelerator) to implement the DNN. The selected number format may be output in any suitable manner. Once the selected number format has been output the method 800 may end or the method 800 may proceed to block 814.
At block 814, hardware logic capable of implementing a DNN is configured to implement the DNN using the number format selected in block 810. Configuring hardware logic to implement a DNN may generally comprise configuring the hardware logic to process inputs to each layer of the DNN in accordance with that layer and provide the output of that layer to a subsequent layer or provide the output as the output of the DNN. For example, if a DNN comprises a first convolution layer and a second normalisation layer, configuring hardware logic to implement such a DNN comprises configuring the hardware logic to receive inputs to the DNN and process the inputs in accordance with the weights of the convolution layer, process the outputs of the convolution layer in accordance with the normalisation layer, and then output the outputs of the normalisation layer as the outputs of the DNN. Configuring hardware logic to implement a DNN using the number format selected in block 810 may comprise configuring the hardware logic to receive and process the set of network parameters in accordance with the selected number format. For example, if the selected number format for a set of network parameters is defined by a bit-width of 6 and an exponent of 4 then the hardware logic to implement the DNN may be configured to interpret and process the network parameters in the set on the basis that they are in a number format defined by a bit width of 6 and an exponent of 4.
In some cases, the method 800 of
At block 1304 one of the sets of network parameters is selected. Then blocks 802 to 810 of the method 800 of
Although in the method 1300 of
Table 4 shows the Top-1 and Top-5 classification accuracy of different classification neural networks trained on the ImageNet validation set for 50,000 labelled images when number formats defined by an exponent and bit width are used for each network parameter type for each layer and the exponent is selected in accordance with the minimum/maximum method, the weighted outlier method, the end-to-end method and the method 1300 of
It can be seen from Table 4 that the method 1300 of
Reference is now made to
The DNN accelerator 1400 of
The example DNN accelerator 1400 of
The input logic 1401 is configured to receive the input data to be processed and provides it to a downstream logic component for processing.
The convolution engine 1402 is configured to perform a convolution operation on the received input data using the weights associated with a particular convolution layer. The weights for each convolution layer of the DNN may be stored in a coefficient buffer 1416 as shown in
The convolution engine 1402 may comprise a plurality of multipliers (e.g. 128) and a plurality of adders which add the result of the multipliers to produce a single sum. Although a single convolution engine 1402 is shown in
The accumulation buffer 1404 is configured to receive the output of the convolution engine and add it to the current contents of the accumulation buffer 1404. In this manner, the accumulation buffer 1404 accumulates the results of the convolution engine 1402 over several hardware passes of the convolution engine 1402. Although a single accumulation buffer 1404 is shown in
The element-wise operations logic 1406 is configured to receive either the input data for the current hardware pass (e.g. when a convolution layer is not processed in the current hardware pass) or the accumulated result from the accumulation buffer 1404 (e.g. when a convolution layer is processed in the current hardware pass). The element-wise operations logic 1406 may either process the received input data or pass the received input data to other logic (e.g. the activation logic 1408 and/or or the normalisation logic 1410) depending on whether an element-wise layer is processed in the current hardware pass and/or depending on whether an activation layer is to be processed prior to an element-wise layer. When the element-wise operations logic 1406 is configured to process the received input data the element-wise operations logic 1406 performs an element-wise operation on the received data (optionally with another data set (which may be obtained from external memory)). The element-wise operations logic 1406 may be configured to perform any suitable element-wise operation such as, but not limited to add, multiply, maximum, and minimum. The result of the element-wise operation is then provided to either the activation logic 1408 or the normalisation logic 1410 depending on whether an activation layer is to be processed subsequent the element-wise layer or not.
The activation logic 1408 is configured to receive one of the following as input data: the original input to the hardware pass (via the element-wise operations logic 1406) (e.g. when a convolution layer is not processed in the current hardware pass); the accumulated data (via the element-wise operations logic 1406) (e.g. when a convolution layer is processed in the current hardware pass and either an element-wise layer is not processed in the current hardware pass or an element-wise layer is processed in the current hardware pass but follows an activation layer). The activation logic 1408 is configured to apply an activation function to the input data and provide the output data back to the element-wise operations logic 1406 where it is forwarded to the normalisation logic 1410 directly or after the element-wise operations logic 1406 processes it. In some cases, the activation function that is applied to the data received by the activation logic 1408 may vary per activation layer. In these cases, information specifying one or more properties of an activation function to be applied for each activation layer may be stored (e.g. in memory) and the relevant information for the activation layer processed in a particular hardware pass may be provided to the activation logic 1408 during that hardware pass.
In some cases, the activation logic 1408 may be configured to store, in entries of a lookup table, data representing the activation function. In these cases, the input data may be used to lookup one or more entries in the lookup table and output values representing the output of the activation function. For example, the activation logic 1408 may be configured to calculate the output value by interpolating between two or more entries read from the lookup table.
In some examples, the activation logic 1408 may be configured to operate as a Rectified Linear Unit (ReLU) by implementing a ReLU function. In a ReLU function, the output element yi,j,k is calculated by identifying a maximum value as set out in equation (24) wherein for x values less than 0, y=0:
i,j,k=ƒ(xi,j,k)=max{0,xi,j,k} (24)
In other examples, the activation logic 1408 may be configured to operate as a Parametric Rectified Linear Unit (PReLU) by implementing a PReLU function. The PReLU function performs a similar operation to the ReLU function. Specifically, where w1, w2, b1, b2 ∈ are constants, the PReLU is configured to generate an output element yi,j,k as set out in equation (25):
y
i,j,k=ƒ(xi,j,k; w1, w2, b1, b2)=max{(w1*xi,j,k+b1), (w2*xi,j,k+b2)} (25)
The normalisation logic 1410 is configured to receive one of the following as input data: the original input data for the hardware pass (via the element-wise operations logic 1406) (e.g. when a convolution layer is not processed in the current hardware pass and neither an element-wise layer nor an activation layer is processed in the current hardware pass); the accumulation output (via the element-wise operations logic 1406) (e.g. when a convolution layer is processed in the current hardware pass and neither an element-wise layer nor an activation layer is processed in the current hardware pass); and the output data of the element-wise operations logic and/or the activation logic. The normalisation logic 1410 then performs a normalisation function on the received input data to produce normalised data. In some cases, the normalisation logic 1410 may be configured to perform a Local Response Normalisation (LRN) Function and/or a Local Contrast Normalisation (LCN) Function. However, it will be evident to a person of skill in the art that these are examples only and that the normalisation logic 1410 may be configured to implement any suitable normalisation function or functions. Different normalisation layers may be configured to apply different normalisation functions.
The pooling logic 1412 may receive the normalised data from the normalisation logic 1410 or may receive the input data to the normalisation logic 1410 via the normalisation logic 1410. In some cases, data may be transferred between the normalisation logic 1410 and the pooling logic 1412 via an XBar 1418. The term “XBar” is used herein to refer to a simple hardware logic that contains routing logic which connects multiple logic components together in a dynamic fashion. In this example, the XBar may dynamically connect the normalisation logic 1410, the pooling logic 1412 and/or the output interleave logic 1414 depending on which layers will be processed in the current hardware pass. Accordingly, the XBar may receive information each pass indicating which logic components 1410, 1412, 1414 are to be connected.
The pooling logic 1412 is configured to perform a pooling function, such as, but not limited to, a max or mean function, on the received data to produce pooled data. The purpose of a pooling layer is to reduce the spatial size of the representation to reduce the number of parameters and computation in the network, and hence to also control overfitting. In some examples, the pooling operation is performed over a sliding window that is defined per pooling layer.
The output interleave logic 1414 may receive the normalised data from the normalisation logic 1410, the input data to the normalisation function (via the normalisation logic 1410), or the pooled data from the pooling logic 1412. In some cases, the data may be transferred between the normalisation logic 1410, the pooling logic 1412 and the output interleave logic 1414 via an XBar 1418. The output interleave logic 1414 is configured to perform a rearrangement operation to produce data that is in a predetermined order. This may comprise sorting and/or transposing the received data. The data generated by the last of the layers is provided to the output logic 1415 where it is converted to the desired output format for the current hardware pass.
The normalisation logic 1410, the pooling logic 1412, and the output interleave logic 1414 may each have access to a shared buffer 1420 which can be used by these logic components 1410, 1412 and 1414 to write data to and retrieve data from. For example, the shared buffer 1420 may be used by these logic components 1410, 1412, 1414 to rearrange the order of the received data or the generated data. For example, one or more of these logic components 1410, 1412, 1414 may be configured to write data to the shared buffer 1420 and read the same data out in a different order. In some cases, although each of the normalisation logic 1410, the pooling logic 1412 and the output interleave logic 1414 have access to the shared buffer 1420, each of the normalisation logic 1410, the pooling logic 1412 and the output interleave logic 1414 may be allotted a portion of the shared buffer 1420 which only they can access. In these cases, each of the normalisation logic 1410, the pooling logic 1412 and the output interleave logic 1414 may only be able to read data out of the shared buffer 1420 that they have written into the shared buffer 1420.
The logic components of the DNN accelerator 1400 that are used or active during any hardware pass are based on the layers that are processed during that hardware pass. In particular, only the logic components related to the layers processed during the current hardware pass are used or active. As described above, the layers that are processed during a particular hardware pass is determined (typically in advance, by, for example, a software tool) based on the order of the layers in the DNN and optionally one or more other factors (such as the size of the data). For example, in some cases the DNN accelerator may be configured to perform the processing of a single layer per hardware pass unless multiple layers can be processed without writing data to memory between layers. For example, if a first convolution layer is immediately followed by a second convolution layer each of the convolution layers would have to be performed in a separate hardware pass as the output data from the first hardware convolution needs to be written out to memory before it can be used as an input to the second. In each of these hardware passes only the logic components, or engines relevant to a convolution layer, such as the convolution engine 1402 and the accumulation buffer 1404, may be used or active.
Although the DNN accelerator 1400 of
Computing-based device 1500 comprises one or more processors 1502 which may be microprocessors, controllers or any other suitable type of processors for processing computer executable instructions to control the operation of the device in order to assess the performance of an integrated circuit defined by a hardware design in completing a task. In some examples, for example where a system on a chip architecture is used, the processors 1502 may include one or more fixed function blocks (also referred to as accelerators) which implement a part of the method of determining the number format for representing a set of values input to, or output from, a layer of a DNN in hardware (rather than software or firmware). Platform software comprising an operating system 1504 or any other suitable platform software may be provided at the computing-based device to enable application software, such as computer executable code 1505 for implementing one or more of the methods 800, 1300 of
The computer executable instructions may be provided using any computer-readable media that is accessible by computing-based device 1500. Computer-readable media may include, for example, computer storage media such as memory 1506 and communications media. Computer storage media (i.e. non-transitory machine readable media), such as memory 1506, includes volatile and non-volatile, removable and non-removable media implemented in any method or technology for storage of information such as computer readable instructions, data structures, program modules or other data. Computer storage media includes, but is not limited to, RAM, ROM, EPROM, EEPROM, flash memory or other memory technology, CD-ROM, digital versatile disks (DVD) or other optical storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other non-transmission medium that can be used to store information for access by a computing device. In contrast, communication media may embody computer readable instructions, data structures, program modules, or other data in a modulated data signal, such as a carrier wave, or other transport mechanism. As defined herein, computer storage media does not include communication media. Although the computer storage media (i.e. non-transitory machine readable media, e.g. memory 1506) is shown within the computing-based device 1500 it will be appreciated that the storage may be distributed or located remotely and accessed via a network or other communication link (e.g. using communication interface 1508).
The computing-based device 1500 also comprises an input/output controller 1510 arranged to output display information to a display device 1512 which may be separate from or integral to the computing-based device 1500. The display information may provide a graphical user interface. The input/output controller 1510 is also arranged to receive and process input from one or more devices, such as a user input device 1514 (e.g. a mouse or a keyboard). In an embodiment the display device 1512 may also act as the user input device 1514 if it is a touch sensitive display device. The input/output controller 1510 may also output data to devices other than the display device, e.g. a locally connected printing device (not shown in
The DNN accelerator 1400 of
The hardware logic configurable to implement a DNN (e.g. the DNN accelerator 1400 of
The terms computer program code and computer readable instructions as used herein refer to any kind of executable code for processors, including code expressed in a machine language, an interpreted language or a scripting language. Executable code includes binary code, machine code, bytecode, code defining an integrated circuit (such as a hardware description language or netlist), and code expressed in a programming language code such as C, Java or OpenCL. Executable code may be, for example, any kind of software, firmware, script, module or library which, when suitably executed, processed, interpreted, compiled, executed at a virtual machine or other software environment, cause a processor of the computer system at which the executable code is supported to perform the tasks specified by the code.
A processor, computer, or computer system may be any kind of device, machine or dedicated circuit, or collection or portion thereof, with processing capability such that it can execute instructions. A processor may be any kind of general purpose or dedicated processor, such as a CPU, GPU, System-on-chip, state machine, media processor, an application-specific integrated circuit (ASIC), a programmable logic array, a field-programmable gate array (FPGA), or the like. A computer or computer system may comprise one or more processors.
It is also intended to encompass software which defines a configuration of hardware as described herein, such as HDL (hardware description language) software, as is used for designing integrated circuits, or for configuring programmable chips, to carry out desired functions. That is, there may be provided a computer readable storage medium having encoded thereon computer readable program code in the form of an integrated circuit definition dataset that when processed (i.e. run) in an integrated circuit manufacturing system configures the system to manufacture hardware logic configurable to implement a DNN (e.g. DNN accelerator) described herein. An integrated circuit definition dataset may be, for example, an integrated circuit description.
Therefore, there may be provided a method of manufacturing, at an integrated circuit manufacturing system, hardware logic configurable to implement a DNN (e.g. DNN accelerator 1400 of
An integrated circuit definition dataset may be in the form of computer code, for example as a netlist, code for configuring a programmable chip, as a hardware description language defining hardware suitable for manufacture in an integrated circuit at any level, including as register transfer level (RTL) code, as high-level circuit representations such as Verilog or VHDL, and as low-level circuit representations such as OASIS (RTM) and GDSII. Higher level representations which logically define hardware suitable for manufacture in an integrated circuit (such as RTL) may be processed at a computer system configured for generating a manufacturing definition of an integrated circuit in the context of a software environment comprising definitions of circuit elements and rules for combining those elements in order to generate the manufacturing definition of an integrated circuit so defined by the representation. As is typically the case with software executing at a computer system so as to define a machine, one or more intermediate user steps (e.g. providing commands, variables etc.) may be required in order for a computer system configured for generating a manufacturing definition of an integrated circuit to execute code defining an integrated circuit so as to generate the manufacturing definition of that integrated circuit.
An example of processing an integrated circuit definition dataset at an integrated circuit manufacturing system so as to configure the system to manufacture hardware logic configurable to implement a DNN (e.g. DNN accelerator) will now be described with respect to
The layout processing system 1704 is configured to receive and process the IC definition dataset to determine a circuit layout. Methods of determining a circuit layout from an IC definition dataset are known in the art, and for example may involve synthesising RTL code to determine a gate level representation of a circuit to be generated, e.g. in terms of logical components (e.g. NAND, NOR, AND, OR, MUX and FLIP-FLOP components). A circuit layout can be determined from the gate level representation of the circuit by determining positional information for the logical components. This may be done automatically or with user involvement in order to optimise the circuit layout. When the layout processing system 1704 has determined the circuit layout it may output a circuit layout definition to the IC generation system 1706. A circuit layout definition may be, for example, a circuit layout description.
The IC generation system 1706 generates an IC according to the circuit layout definition, as is known in the art. For example, the IC generation system 1706 may implement a semiconductor device fabrication process to generate the IC, which may involve a multiple-step sequence of photo lithographic and chemical processing steps during which electronic circuits are gradually created on a wafer made of semiconducting material. The circuit layout definition may be in the form of a mask which can be used in a lithographic process for generating an IC according to the circuit definition. Alternatively, the circuit layout definition provided to the IC generation system 1706 may be in the form of computer-readable code which the IC generation system 1706 can use to form a suitable mask for use in generating an IC.
The different processes performed by the IC manufacturing system 1702 may be implemented all in one location, e.g. by one party. Alternatively, the IC manufacturing system 1702 may be a distributed system such that some of the processes may be performed at different locations, and may be performed by different parties. For example, some of the stages of: (i) synthesising RTL code representing the IC definition dataset to form a gate level representation of a circuit to be generated, (ii) generating a circuit layout based on the gate level representation, (iii) forming a mask in accordance with the circuit layout, and (iv) fabricating an integrated circuit using the mask, may be performed in different locations and/or by different parties.
In other examples, processing of the integrated circuit definition dataset at an integrated circuit manufacturing system may configure the system to manufacture hardware logic configurable to implement a DNN (e.g. DNN accelerator) without the IC definition dataset being processed so as to determine a circuit layout. For instance, an integrated circuit definition dataset may define the configuration of a reconfigurable processor, such as an FPGA, and the processing of that dataset may configure an IC manufacturing system to generate a reconfigurable processor having that defined configuration (e.g. by loading configuration data to the FPGA).
In some embodiments, an integrated circuit manufacturing definition dataset, when processed in an integrated circuit manufacturing system, may cause an integrated circuit manufacturing system to generate a device as described herein. For example, the configuration of an integrated circuit manufacturing system in the manner described above with respect to
In some examples, an integrated circuit definition dataset could include software which runs on hardware defined at the dataset or in combination with hardware defined at the dataset. In the example shown in
The implementation of concepts set forth in this application in devices, apparatus, modules, and/or systems (as well as in methods implemented herein) may give rise to performance improvements when compared with known implementations. The performance improvements may include one or more of increased computational performance, reduced latency, increased throughput, and/or reduced power consumption. During manufacture of such devices, apparatus, modules, and systems (e.g. in integrated circuits) performance improvements can be traded-off against the physical implementation, thereby improving the method of manufacture. For example, a performance improvement may be traded against layout area, thereby matching the performance of a known implementation but using less silicon. This may be done, for example, by reusing functional blocks in a serialised fashion or sharing functional blocks between elements of the devices, apparatus, modules and/or systems. Conversely, concepts set forth in this application that give rise to improvements in the physical implementation of the devices, apparatus, modules, and systems (such as reduced silicon area) may be traded for improved performance. This may be done, for example, by manufacturing multiple instances of a module within a predefined area budget.
The applicant hereby discloses in isolation each individual feature described herein and any combination of two or more such features, to the extent that such features or combinations are capable of being carried out based on the present specification as a whole in the light of the common general knowledge of a person skilled in the art, irrespective of whether such features or combinations of features solve any problems disclosed herein. In view of the foregoing description it will be evident to a person skilled in the art that various modifications may be made within the scope of the invention.
Number | Date | Country | Kind |
---|---|---|---|
2009432.2 | Jun 2020 | GB | national |