This application relates to the field of neural networks, and in particular, to a neural network accelerator, an acceleration method, and an apparatus.
Artificial intelligence (artificial intelligence, AI) is a theory, a method, a technology, or an application system that simulates, extends, and expands human intelligence by using a digital computer or a machine controlled by a digital computer, to perceive an environment, obtain knowledge, and achieve an optimal result based on the knowledge. In other words, artificial intelligence is a branch of computer science, and is intended to understand the essence of intelligence and produce a new intelligent machine that can react in a manner similar to human intelligence. Artificial intelligence is to study design principles and implementation methods of various intelligent machines, so that the machines have perception, inference, and decision-making functions. Researches in the field of artificial intelligence include robotics, natural language processing, computer vision, decision-making and inference, human-computer interaction, recommendation and search, an AI basic theory, and the like.
A neural network belongs to a connectionist school in the field of artificial intelligence, and is a mathematical model that uses a structure similar to a brain nerve synaptic connection to process information. Calculations in the neural network mainly include a convolution operation, an activation operation, a pooling operation, and the like. The convolution operation occupies most time of neural network processing. To obtain a high performance and energy efficiency ratio in a limited area, researchers currently propose a winograd algorithm-based convolution operation mode. In this mode, specific matrix conversion is performed on an input feature map and a weight, so that an equivalent convolution operation task can be completed and multiplication operations in a convolution operation process can be greatly reduced.
However, for an accelerator that currently integrates the winograd algorithm to perform acceleration, core operation modules such as a matrix operation module and a vector operation module in the neural network usually need to be modified much, and a design is complex.
An embodiment of this application provides a neural network accelerator. The neural network accelerator is based on a winograd algorithm, and may apply the winograd algorithm to a neural network by using a conventional matrix operation module and vector operation module in the neural network. For a convolutional layer or pooling layer whose size is 3×3 (rows×columns) and whose stride is 1, a quantity of multiplication times can be greatly reduced, to improve performance and an energy efficiency ratio of an accelerator.
To achieve the foregoing objective, embodiments of this application provide the following technical solutions.
A first aspect of this application provides a neural network accelerator, including: a preprocessing module, configured to perform first forward winograd transform on a target matrix corresponding to an input feature map, to obtain a transformed target matrix, where performing the first forward winograd transform on the target matrix may be understood as left multiplying the target matrix by a matrix BT and right multiplying the target matrix by a matrix B, to obtain the transformed target matrix, and the preprocessing module is further configured to perform second forward winograd transform on a convolution kernel, to obtain a transformed convolution kernel, where performing the second forward winograd transform on the convolution kernel may be understood as left multiplying the convolution kernel by a matrix G and right multiplying the convolution kernel by a matrix GT, to obtain the transformed convolution kernel; a matrix operation module, configured to perform a matrix multiplication operation on a first matrix and a second matrix, to obtain a multiplication result, where the first matrix is constructed based on the transformed target matrix, and the second matrix is constructed based on the transformed convolution kernel; and a vector operation module, configured to perform inverse winograd transform on the multiplication result, to obtain an output feature map, where a process of performing the inverse winograd transform on the matrix multiplication result is equivalent to performing a vector addition and subtraction operation on the matrix multiplication result, and this may be implemented by using a conventional vector operation module.
It can be learned from the first aspect that a target matrix and a convolution kernel that are obtained after forward winograd transform is performed are used to construct the first matrix and the second matrix respectively, then a matrix multiplication operation is performed on the first matrix and the second matrix by using an existing matrix operation module in the neural network accelerator, and inverse winograd transform is performed on the multiplication result by using an existing vector operation module in the neural network accelerator, so that modification of core operation modules such as a matrix operation module and a vector operation module in a neural network is avoided, a design is simple, a module configured to perform a point multiplication operation on the target matrix and the convolution kernel that are obtained after the forward winograd transform is performed is avoided from being added to the neural network accelerator, and efficiency of performing winograd calculation by the neural network accelerator is improved.
Optionally, with reference to the first aspect, in a first possible implementation, the preprocessing module is further configured to traverse the input feature map by using a sliding window, to obtain the target matrix corresponding to the input feature map. It can be learned from the first possible implementation of the first aspect that a specific manner of obtaining the target matrix is provided, the input feature map may be traversed by using the sliding window, and the target matrix is an input feature map of an area corresponding to the sliding window.
Optionally, with reference to the first possible implementation of the first aspect, in a second possible implementation, the input feature map is an input feature map on which a padding padding operation is performed, a size of the input feature map is W×H×k, W and H each are an even number not less than 4, k is an integer greater than 1, W is a row of the input feature map, H is a column of the input feature map, and k is a quantity of channels of the input feature map. The padding may be understood as adding some pixels to the periphery of the input feature map, for example, these pixels are initialized to 0 or another specified value. For an input feature map whose row and column each are not an even number not less than 4, pixels may be added to the periphery of the input feature map in a padding process, so that the row and the column of the input feature map each are an even number not less than 4. The input feature map is traversed by using a sliding window whose stride is 2 and whose size is 4×4, to obtain (((W−2)(H−2)/4)×k) target matrices, where the target matrices each are an input feature map of an area corresponding to the sliding window. It can be learned from the first possible implementation of the first aspect that a specific manner of determining the target matrix of the input feature map is provided. This increases diversity of a solution.
Optionally, with reference to the second possible implementation of the first aspect, in a third possible implementation, a size of the convolution kernel is 3×3×k×n, a stride of the convolution kernel is 1, n is a quantity of channels of the output feature map, and n is an integer greater than 1. According to the solution provided in this application, for a convolutional layer or pooling layer whose size is 3×3 (rows×columns) and whose stride is 1, a quantity of multiplication times can be greatly reduced, to improve performance and an energy efficiency ratio of an accelerator.
Optionally, with reference to the third possible implementation of the first aspect, in a fourth possible implementation, the first matrix includes an ith element in the transformed target matrix, i is a positive integer not greater than 16, the first matrix is a matrix with m rows and k columns, m is equal to ((W−2)(H−2)/4), the second matrix includes an ith element of the transformed convolution kernel, the second matrix is a matrix with K rows and n columns, and the multiplication result is used to determine the output feature map. It can be learned from the third possible implementation of the first aspect that a specific manner of constructing the first matrix and the second matrix is provided.
Optionally, with reference to the first aspect or the first to the fourth possible implementations of the first aspect, in a fifth possible implementation, the vector operation module is specifically configured to: perform the inverse winograd transform on the multiplication result, to obtain a third matrix; and reorder elements in the third matrix by using a preset reordering rule, to obtain the output feature map. It can be learned from the second possible implementation of the first aspect that, if the input feature map is processed at the convolutional layer, after the vector operation module processes multiplication results of 16 matrices, processed results are reordered, to obtain the output feature map.
Optionally, with reference to the first aspect or the first to the fourth possible implementations of the first aspect, in a sixth possible implementation, the vector operation module is specifically configured to: perform the inverse winograd transform on the multiplication result, to output a third matrix; and perform a summation operation on elements in the third matrix, to obtain the output feature map. It can be learned from the third possible implementation of the first aspect that, if the input feature map is processed at the pooling layer, a summation operation or a maximization operation may be performed on the elements in the third matrix, to obtain the output feature map.
Optionally, with reference to the first aspect or the first to the sixth possible implementations of the first aspect, in a seventh possible implementation, the second forward winograd transform includes third forward winograd transform and fourth forward winograd transform, and the neural network accelerator further includes a storage module. The storage module is configured to store a first transformation result of performing the third forward winograd transform on the convolution kernel by using the third matrix. A matrix transform unit is specifically configured to perform the fourth forward winograd transform on the first transformation result by using a fourth matrix, to obtain the transformed convolution kernel. The third matrix and the fourth matrix are matrices obtained after a transformation matrix of the second forward winograd transform is decomposed, a value of an element in the third matrix is 0 or ±1, and the fourth matrix is a matrix other than the third matrix in the matrices obtained after decomposition. It can be learned from the fourth possible implementation of the first aspect that, to reduce a calculation amount of the matrix transform unit in the accelerator, a part of a process of the second forward winograd transform may be performed offline.
Optionally, with reference to the first aspect or the first to the seventh possible implementations of the first aspect, in an eighth possible implementation, the matrix transform unit is further configured to: obtain M elements of a plurality of transformed target matrices, where M is an integer greater than 1; process the M elements according to a first preset formula, to output a plurality of first matrices; obtain N elements of a plurality of transformed convolution kernels, where N is an integer greater than 1; and process the N elements according to a second preset formula, to output a plurality of second matrices. It can be learned from the fifth possible implementation of the first aspect that, to further improve the performance of the accelerator, a plurality of elements in each transformed target matrix may be extracted at a time, and a plurality of first matrices may be output at a time. In addition, a plurality of elements in the transformed convolution kernel may be extracted at a time, and a plurality of second matrices may be output at a time.
Optionally, with reference to the first to the eighth possible implementations of the first aspect, in a ninth possible implementation, the vector operation module is further configured to dequantize the multiplication result. The vector operation module is specifically configured to perform the inverse winograd transform on a multiplication result obtained after de-quantization. The vector operation module is further configured to quantize the output feature map, to obtain a quantized output feature map. It can be learned from the sixth possible implementation of the first aspect that, to meet a requirement of an operation of a fixed point number, a quantization operation and a de-quantization operation may be added.
Optionally, with reference to the first aspect or the first to the ninth possible implementations of the first aspect, in a tenth possible implementation, the vector operation module is further configured to perform an offset operation on at least one multiplication result. It can be learned from the seventh possible implementation of the first aspect that, in the solution provided in this application, performing an offset operation on a multiplication result may be equivalent to performing an offset operation on the output feature map.
A second aspect of this application provides an acceleration method, including: performing first forward winograd transform on a target matrix corresponding to an input feature map, to obtain a transformed target matrix; performing second forward winograd transform on a convolution kernel, to obtain a transformed convolution kernel; performing a matrix multiplication operation on a first matrix and a second matrix, to obtain a multiplication result, where the first matrix is constructed based on the transformed target matrix, and the second matrix is constructed based on the transformed convolution kernel; and performing inverse winograd transform on the multiplication result, to obtain an output feature map.
Optionally, with reference to the second aspect, in a first possible implementation, the input feature map is traversed by using a sliding window, to obtain the target matrix corresponding to the input feature map.
Optionally, with reference to the first possible implementation of the second aspect, in a second possible implementation, the input feature map is an input feature map on which a padding padding operation is performed, a size of the input feature map is W×H×k, W and H each are an even number not less than 4, k is an integer greater than 1, W is a row of the input feature map, H is a column of the input feature map, and k is a quantity of channels of the input feature map. The input feature map is traversed by using a sliding window whose stride is 2 and whose size is 4×4, to obtain (((W−2)(H−2)/4)×k) target matrices.
Optionally, with reference to the second possible implementation of the second aspect, in a third possible implementation, the padding padding operation is performed on the input feature map, so that the size of the input feature map is W×H×k, where W and H each are an even number not less than 4, k is an integer greater than 1, W is the row of the input feature map, H is the column of the input feature map, and k is the quantity of channels of the input feature map. The input feature map is traversed by using the sliding window whose stride is 2 and whose size is 4×4, to obtain (((W−2)(H−2)/4)×k) target matrices.
Optionally, with reference to the third possible implementation of the second aspect, in a fourth possible implementation, a size of the convolution kernel is 3×3×k×n, a stride of the convolution kernel is 1, n is a quantity of channels of the output feature map, and n is an integer greater than 1.
Optionally, with reference to the fourth possible implementation of the second aspect, in a fifth possible implementation, the first matrix includes an ith element in the transformed target matrix, i is a positive integer not greater than 16, the first matrix is a matrix with m rows and k columns, m is equal to ((W−2)(H−2)/4), the second matrix includes an ith element of the transformed convolution kernel, the second matrix is a matrix with K rows and n columns, and the multiplication result is used to determine the output feature map.
Optionally, with reference to the second aspect or the first to the fifth possible implementations of the second aspect, in a sixth possible implementation, the performing inverse winograd transform on the multiplication result, to obtain an output feature map includes: performing the inverse winograd transform on the multiplication result, to obtain a third matrix; and reordering elements in the third matrix by using a preset reordering rule, to obtain the output feature map.
Optionally, with reference to the second aspect or the first to the sixth possible implementations of the second aspect, in a seventh possible implementation, the performing inverse winograd transform on the multiplication result, to obtain an output feature map includes: performing the inverse winograd transform on the multiplication result, to output a third matrix; and performing a summation operation on elements in the third matrix, to obtain the output feature map.
Optionally, with reference to the second aspect or the first to the seventh possible implementations of the second aspect, in an eighth possible implementation, the second forward winograd transform includes third forward winograd transform and fourth forward winograd transform, and the performing second forward winograd transform on a convolution kernel whose size is 3×3×k×n and whose stride is 1, to obtain a transformed convolution kernel includes: performing the third forward winograd transform on the convolution kernel by using the third matrix, to obtain a first transformation result; and performing the fourth forward winograd transform on the first transformation result by using a fourth matrix, to obtain the transformed convolution kernel, where the third matrix and the fourth matrix are matrices obtained after a transformation matrix of the second forward winograd transform is decomposed, a value of an element in the third matrix is 0 or ±1, and the fourth matrix is a matrix other than the third matrix in the matrices obtained after decomposition.
Optionally, with reference to the second aspect or the first to the eighth possible implementations of the second aspect, in a ninth possible implementation, the method further includes: obtaining M elements of a plurality of transformed target matrices, where M is an integer greater than 1; processing the M elements according to a first preset formula, to output a plurality of first matrices; obtaining N elements of a plurality of transformed convolution kernels, where N is an integer greater than 1; and processing the N elements according to a second preset formula, to output a plurality of second matrices.
Optionally, with reference to the second aspect or the first possible implementation of the first aspect to the eighth possible implementation of the second aspect, the method further includes: dequantizing the multiplication result, to obtain a dequantized multiplication result. The performing inverse winograd transform on the multiplication result, to obtain an output feature map includes: performing the inverse winograd transform on the dequantized multiplication result, to obtain the output feature map. The method further includes: quantizing the output feature map, to obtain a quantized output feature map.
Optionally, with reference to the second aspect or the first to the ninth possible implementations of the second aspect, in an eleventh possible implementation, the method further includes: performing an offset operation on the multiplication result.
A third aspect of this application provides a neural network apparatus. The neural network apparatus includes a neural network accelerator. The neural network accelerator is the neural network accelerator described in any one of the first aspect or the possible implementations of the first aspect.
A fourth aspect of this application provides a chip system. The chip system includes a processor and a communication interface. The processor obtains program instructions through the communication interface, and when the program instructions are executed by the processor, the method described in any one of the second aspect or the possible implementations of the second aspect is implemented.
A fifth aspect of this application provides a chip system. The chip system includes a processor and a memory, the memory stores a program, and when the program instructions stored in the memory are executed by the processor, the method described in any one of the second aspect or the possible implementations of the second aspect is implemented.
A sixth aspect of this application provides a computer-readable storage medium, including a program. When the program is executed by a processing unit, the method described in any one of the second aspect or the possible implementations of the second aspect is performed.
A seventh aspect of this application provides a computer program product. When the computer program product runs on a computer, the computer is enabled to perform the method described in any one of the second aspect or the possible implementations of the second aspect.
To describe the technical solutions in embodiments of the present application or in the conventional technology more clearly, the following briefly describes the accompanying drawings for describing embodiments or the conventional technology. It is clear that the accompanying drawings in the following description show merely some embodiments of the present application, and a person of ordinary skill in the art may still derive other drawings from these accompanying drawings without creative efforts.
The following describes embodiments of this application with reference to the accompanying drawings. It is clear that the described embodiments are merely some but not all of embodiments of this application. A person of ordinary skill in the art may learn that, with technology development and emergence of a new scenario, the technical solutions provided in embodiments of this application are also applicable to a similar technical problem.
In the specification, claims, and the accompanying drawings of this application, the terms “first”, “second”, and the like are intended to distinguish between similar objects but do not necessarily indicate a specific order or sequence. It should be understood that data termed in such a way are interchangeable in proper circumstances, so that the embodiments described herein can be implemented in other orders than the order illustrated or described herein. Moreover, the terms “including”, “having”, and any other variants thereof are intended to cover a non-exclusive inclusion, for example, a process, method, system, product, or device that includes a series of steps or modules is not necessarily limited to those steps or modules that are clearly listed, but may include other steps or modules that are not clearly listed or that are inherent to such a process, method, product, or device. Names or numbers of steps in this application do not mean that the steps in a method procedure need to be performed in a time/logical sequence indicated by the names or numbers. An execution sequence of the steps in the procedure that have been named or numbered can be changed based on a technical objective to be achieved, provided that same or similar technical effects can be achieved. Division into the modules in this application is logical division. During actual application, there may be another division manner. For example, a plurality of modules may be combined or integrated into another system, or some features may be ignored or not performed. In addition, the displayed or discussed mutual coupling or direct coupling or communication connection may be through some ports, and the indirect coupling or communication connection between modules may be in an electrical form or another similar form. This is not limited in this application. In addition, modules or sub-modules described as separate components may be or may not be physically separated, or may be or may not be physical modules, or may not be grouped into a plurality of circuit modules. Objectives of the solutions of this application may be achieved by selecting some or all of the modules according to actual requirements.
Embodiments of this application relate to application of a large quantity of neural networks. Therefore, for ease of understanding, the following first describes related concepts of the neural network.
A neural network may include a neuron. The neuron may be an operation unit that uses xs and an intercept of 1 as an input. An output of the operation unit may be as follows:
h
W,b
=f(WTx)=f(Σs=1nWSxs+b)
s=1, 2, . . . , or n, n is a natural number greater than 1, Ws is a weight of xs and b is a bias of the neuron. f indicates an activation function (activation functions) of the neuron, and the activation function is used for introducing a non-linear characteristic into the neural network, to convert an input signal in the neuron into an output signal. The output signal of the activation function may be used as an input of a next convolutional layer, and the activation function may be a sigmoid function. The neural network is a network constituted by connecting a plurality of single neurons together. To be specific, an output of a neuron may be an input to another neuron. An input of each neuron may be connected to a local receptive field of a previous layer to extract a feature of the local receptive field. The local receptive field may be an area including several neurons.
A convolutional neural network (convolutional neuron network, CNN) is a deep neural network with a convolutional architecture. The convolutional neural network includes a feature extractor including a convolutional layer and a sub-sampling layer, and the feature extractor may be considered as a filter. The convolutional layer is a neuron layer that is in the convolutional neural network and at which convolution processing is performed on an input signal. At the convolutional layer of the convolutional neural network, one neuron may be connected only to some adjacent-layer neurons. One convolutional layer usually includes several feature planes, and each feature plane may include some neurons arranged in a rectangular form. Neurons on a same feature plane share a weight, where the shared weight is a convolution kernel. Weight sharing may be understood as that an image information extraction manner is irrelevant to a location. The convolution kernel may be initialized in a form of a random-size matrix. In a process of training the convolutional neural network, the convolution kernel may obtain an appropriate weight through learning. In addition, a direct benefit brought by weight sharing is that connections between layers of the convolutional neural network are reduced and an overfitting risk is lowered.
Because the CNN is a common neural network, the following focuses on a structure of the CNN in detail with reference to
A structure of a neural network in embodiments of this application may be shown in
Convolutional Layer/Pooling Layer 220:
Convolutional Layer:
As shown in
The following uses the convolutional layer 221 as an example to describe an internal working principle of one convolutional layer.
The convolutional layer 221 may include a plurality of convolution operators. The convolution operator is also referred to as a kernel or a convolution kernel. During image processing, the convolution operator functions as a filter that extracts specific information from an input image matrix. The convolution operator may essentially be a weight matrix, and the weight matrix is usually predefined. In a process of performing a convolution operation on an image, the weight matrix usually processes pixels at a granularity level of one pixel (or two pixels or the like, depending on a value of a stride stride) in a horizontal direction on an input image, to extract a specific feature from the image. A size of the weight matrix should be related to a size of the image. It should be noted that a depth dimension (depth dimension) of the weight matrix is the same as a depth dimension of the input image. During a convolution operation, the weight matrix extends to an entire depth of the input image. Therefore, a convolutional output of a single depth dimension is generated through convolution with a single weight matrix. However, in most cases, a single weight matrix is not used, but a plurality of weight matrices with a same size (rows×columns), namely, a plurality of same-type matrices, are applied. Outputs of the weight matrices are stacked to form a depth dimension of a convolutional image. The dimension herein may be understood as being determined based on the foregoing “plurality”. Different weight matrices may be used to extract different features from the image. For example, one weight matrix is used to extract edge information of the image, another weight matrix is used to extract a specific color of the image, and still another weight matrix is used to blur unnecessary noise in the image. The plurality of weight matrices have the same size (rows×columns), and convolutional feature maps extracted by the plurality of weight matrices with the same size have a same size. Then, the plurality of extracted convolutional feature maps with the same size are combined to form an output of the convolution operation.
Weight values in these weight matrices need to be obtained through a lot of training during actual application. Each weight matrix formed by using the weight values obtained through training may be used to extract information from an input image, to enable the convolutional neural network 200 to perform correct prediction.
When the convolutional neural network 200 has a plurality of convolutional layers, a large quantity of general features are usually extracted at an initial convolutional layer (for example, 221). The general feature may also be referred to as a low-level feature. As the depth of the convolutional neural network 200 increases, a feature extracted at a subsequent convolutional layer (for example, 226) becomes more complex, for example, a high-level semantic feature. A feature with higher semantics is more applicable to a to-be-resolved problem.
Pooling Layer:
Because a quantity of training parameters usually needs to be reduced, a pooling layer usually needs to be periodically introduced after a convolutional layer. To be specific, for the layers 221 to 226 in the layer 220 shown in
Neural Network Layer 230:
After processing performed at the convolutional layer/pooling layer 220, the convolutional neural network 200 is not ready to output required output information. As described above, at the convolutional layer/pooling layer 220, only a feature is extracted, and parameters brought by an input image are reduced. However, to generate final output information (required class information or other related information), the convolutional neural network 200 needs to use the neural network layer 230 to generate an output of a quantity of one or a group of classes. Therefore, the neural network layer 230 may include a plurality of hidden layers (such as 231 and 232 to 23n shown in
After the plurality of hidden layers in the neural network layer 230, that is, the last layer of the entire convolutional neural network 200 is the output layer 240. The output layer 240 has a loss function similar to cross entropy for classification, and is specifically configured to calculate a prediction error. Once forward propagation (for example, propagation in a direction from 210 to 240 in
The structure of the neural network in embodiments of this application may be shown in
It should be noted that the convolutional neural network shown in
Calculations in the neural network mainly include a convolution operation, an activation operation, a pooling operation, and the like. The convolution operation occupies most time of neural network processing. In addition, a convolutional layer whose size is 3×3 (rows×columns) and whose stride is 1 of a convolution kernel occupies a large proportion in convolution calculation. Therefore, acceleration of this type of convolutional layer is of great value. By using a winograd algorithm, a quantity of multiplication times of an algorithm of the convolutional layer whose size is 3×3 and whose stride is 1 may be greatly reduced. This is beneficial to hardware performance improvement and energy efficiency ratio improvement. To better understand the solution, the following describes the winograd algorithm.
For the winograd algorithm, an input signal D may be considered as a 4×4 matrix, as shown in the following formula 1-1, and a convolution kernel K is considered as a 3×3 matrix, as shown in the following formula 1-2.
According to the winograd algorithm, a matrix multiplication form of convolution of D and K may be represented by the following formula 1-3. Because it is the conventional technology to transform a convolution operation according to the winograd algorithm, derivation is not performed in this application, and only a result obtained after derivation is listed. The formula 1-3 represents that a matrix D of the input signal is left multiplied by a matrix BT and right multiplied by a matrix B, to obtain a transformed matrix U. This process is a process of performing forward winograd transform on the input signal. A size of the matrix U is 4×4. A matrix K corresponding to the convolution kernel is left multiplied by a matrix G and right multiplied by a matrix GT, to obtain a transformed matrix V. A size of the matrix V is 4×4. This process is a process of performing forward winograd transform on the convolution kernel. A point multiplication operation is performed on the matrix U and the matrix V, to obtain a matrix U×V, and then the matrix U×V is left multiplied by a matrix AT and right multiplied by a matrix A, to obtain a matrix corresponding to a final output signal. This process is a process of inverse winograd transform.
BT is represented by using a formula 1-4, B is represented by using a formula 1-5, G is represented by using a formula 1-6, GT is represented by using a formula 1-7, AT is represented by using a formula 1-8, and A is represented by using a formula 1-9. The output signal is a 2×2 matrix and is represented by using a formula 2-0 in this application.
After winograd transform, a quantity of multiplication times can be reduced from 36 to 16. If the winograd algorithm is extended to a neural network with a 3×3 convolution kernel, an energy efficiency ratio can be improved.
Currently, most matrix operation-based CNN accelerators are not integrated with 2D winograd for acceleration, and have bottlenecks in energy efficiency ratio and computing power. For an accelerator that integrates the winograd algorithm for acceleration, a core calculating unit usually needs to be modified a lot. For example, a matrix operation module and a vector operation module in a neural network need to be modified a lot, or dedicated hardware support is required, and a hardware module for performing a point multiplication operation needs to be added if necessary. Currently, a solution of applying the winograd algorithm to an accelerator of a neural network is not ideal.
In this application, a disadvantage of an existing method is comprehensively considered, and a conventional matrix operation module (matrix unit) and a conventional vector operation module (vector unit) are used to apply the winograd algorithm to an accelerator of a neural network. There is no need to modify the core calculating unit a lot and no dedicated hardware support is required.
For better understand this application, the following specifically describes a research idea of the technical solution described in this application.
It can be learned from the foregoing description of the winograd algorithm that, in the winograd algorithm, the input signal D is a 4×4 matrix, but during actual application, an input feature map may be of any size. To resolve this problem, the input feature map may be traversed by using a sliding window whose size is 4×4. In this case, an area corresponding to each sliding window is a 4×4 matrix. In this application, the area corresponding to the sliding window is referred to as a target matrix. In addition, in the winograd algorithm, a stride of a convolution kernel convolved with an input signal whose size is 4×4 is 1, to obtain an output signal. The output signal is a 2×2 matrix. In this case, in this solution, to output an output feature map corresponding to the input feature map, a stride of the sliding window whose size is 4×4 is set to 2. After it is determined that the stride of the sliding window is 2, a row and a column of the input feature map each should be an even number, to obtain an integer quantity of sliding windows. If the row and the column of the input feature map each are not an even number, a padding (padding) operation may be first performed on the input feature map, so that the row and the column of the input feature map each are an even number. In the winograd algorithm, the input signal D is a 4×4 matrix. Therefore, in this application, to use the winograd algorithm, the row and column of the input feature map each should be an even number not less than 4.
In the solution provided in this application, a matrix transform unit may be added. The matrix transform unit may perform forward winograd transform on each target matrix, to obtain a transformed target matrix. A process of performing forward winograd transform on a target matrix may be understood with reference to a process of performing forward transform on the input signal in the winograd algorithm, to be specific, the target matrix is left multiplied by a matrix BT and right multiplied by a matrix B, to obtain a transformed target matrix. Forward winograd transform may be performed on each convolution kernel by using the matrix transform unit, to obtain a transformed convolution kernel. A process of performing forward winograd transform on a convolution kernel may be understood with reference to a process of performing forward transform on a convolution kernel in the winograd algorithm, to be specific, the convolution kernel is left multiplied by a matrix G and right multiplied by a matrix GT, to obtain a transformed convolution kernel.
In addition, in a convolutional neural network, an input feature map includes a plurality of image channels, that is, compared with the input signal in the winograd algorithm, one dimension is added to the input feature map, and the added dimension is a quantity of input channels. In the convolutional neural network, a convolution kernel includes the dimension of the quantity of input channels, and the convolution kernel further includes a dimension of a quantity of output channels (namely, a quantity of convolution kernels). In other words, compared with the convolution kernel in the winograd algorithm, two dimensions are further added to the convolution kernel in the convolutional neural network: the quantity of input channels and the quantity of output channels. In the winograd algorithm, a point multiplication operation needs to be performed on a matrix U and a matrix V. In the convolutional neural network, the dimension of the quantity of input channels is added to the input feature map, and the dimension of the quantity of input channels and the dimension of the quantity of output channels are added to the convolution kernel. Therefore, the winograd algorithm cannot be directly applied to the convolutional neural network. In the conventional technology, a core calculating unit of the convolutional neural network usually needs to be modified a lot, or a dedicated hardware support is needed. However, in the solution provided in this application, a point multiplication operation process is converted into a matrix multiplication operation based on obtaining the transformed target matrix and the transformed convolution kernel. According to the solution provided in this application, the winograd algorithm can be applied to the convolutional neural network only by adding a matrix transform unit and then using a conventional matrix operation module and a conventional vector operation module in the convolutional neural network. For how to convert the point multiplication operation into the matrix multiplication operation, in this application, a first matrix and a second matrix are constructed, to convert the point multiplication operation into the multiplication of the first matrix and the second matrix. The first matrix includes an ith element in each transformed target matrix, i is a positive integer not greater than 16, the first matrix is a matrix with m rows and k columns, and m is equal to (W−2)(H−2)4). The second matrix includes an ith element in each transformed convolution kernel, and the second matrix is a matrix with K rows and n columns. A multiplication result is used to determine an output feature map. Through the foregoing process, 16 first matrices and 16 second matrices may be obtained, and the 16 first matrices are multiplied by the 16 second matrices in a one-to-one correspondence, to obtain 16 multiplication results. For example, when i is 1, the first matrix includes the first element in each transformed target matrix, the second matrix includes the first element in each transformed convolution kernel, and the first matrix is multiplied by the second matrix, to obtain a first multiplication result. When i is 2, the first matrix includes the second element in each transformed target matrix, the second matrix includes the second element in each transformed convolution kernel, and the first matrix is multiplied by the second matrix, to obtain a second multiplication result. By analogy, when i is 16, a sixteenth multiplication result may be obtained. In this application, the multiplication result is sometimes referred to as a matrix multiplication result, and the matrix multiplication result and the matrix multiplication result have a same meaning. Then, the vector operation module performs inverse winograd transform on the matrix multiplication result, and a process of performing inverse winograd transform on the matrix multiplication result is to left multiply the matrix multiplication result by a matrix AT and right multiply the matrix multiplication result by a matrix A. In this solution, a manner of constructing the first matrix and the second matrix is used to convert a result of the point multiplication operation into 16 matrix multiplication results. Therefore, a process of performing inverse winograd transform on the matrix multiplication results is equivalent to performing a vector addition and subtraction operation on the 16 matrix multiplication results, and this may be implemented by using a conventional vector operation module. A specific process is described in detail below. After the vector operation module processes the 16 vector multiplication results, the processed results are reordered, or a sum or an accumulated sum of the processed results is calculated to obtain the output feature map corresponding to the input feature map.
In addition, on the basis of the foregoing research idea, to reduce an area of an accelerator, in the solution provided in this application, a forward transform process of a convolution kernel is divided into two parts, one part of the process is performed offline, and the other part of the process is performed on a chip; or, a forward transform result of the convolution kernel is obtained through offline calculation. In addition, data formats of the input feature map and the convolution kernel may be fixed point numbers. To meet a requirement of a convolution operation of the fixed point numbers, the solution provided in this application may support de-quantization and quantization processing, and a de-quantization process may be performed before an inverse transform operation, so that a bit width can be reduced and computing power is greater. In addition, in the solution provided in this application, performing an offset operation on a multiplication result may be equivalent to performing an offset operation on the output feature map. In addition, to improve operation efficiency of the accelerator, the matrix transform unit, the matrix operation module, and the vector operation module in the solution provided in this application may act in parallel as pipelining. Some calculations in the solution provided in this application may be on-the-fly calculations. For example, some inverse winograd transforms may be completed through on-the-fly calculation (on-the-fly calculation) in a process of transferring from the matrix operation module to the vector operation module.
Based on the foregoing research idea, the following specifically describes the technical solutions provided in this application.
Compared with an existing neural network accelerator in the conventional technology, the neural network accelerator provided in this application only needs to add a preprocessing module, to apply a winograd algorithm to a neural network.
The preprocessing module 301 is configured to perform first forward winograd transform on a target matrix corresponding to an input feature map, to obtain a transformed target matrix.
The preprocessing module 301 is further configured to perform second forward winograd transform on a convolution kernel, to obtain a transformed convolution kernel.
The matrix operation module 302 is configured to perform a matrix multiplication operation on a first matrix and a second matrix, to obtain a multiplication result. The first matrix is constructed based on the transformed target matrix, and the second matrix is constructed based on the transformed convolution kernel. In some implementations, the matrix operation module 302 includes a plurality of processing units (process engine, PE). In some implementations, the matrix operation module 302 is a two-dimensional systolic array. Alternatively, the matrix operation module 302 may be a one-dimensional systolic array or another electronic circuit that can perform mathematical operations such as multiplication and addition. In some implementations, the matrix operation module 302 is a general-purpose matrix processor. For example, it is assumed that there are an input matrix A, a weight matrix B, and an output matrix C. The matrix operation module fetches data corresponding to the matrix B from a memory, and buffers the data on each PE in the matrix operation module. The matrix operation module obtains data of the matrix A from the memory, and performs a matrix operation on the data of the matrix A and the data of the matrix B.
The vector operation module 303 is configured to perform inverse winograd transform on the multiplication result, to obtain an output feature map. The vector operation module includes a plurality of operation processing units, to perform further processing on an output of the matrix operation module if necessary, for example, vector multiplication, vector addition, an exponential operation, a logarithm operation, and size comparison. The vector operation module is mainly configured to perform network calculation at a non-convolutional/fully connected layer in the neural network, for example, batch normalization (batch normalization), pixel-level summation, and upsampling on a feature plane.
In some possible implementations, with reference to
The obtaining unit 3011 is configured to obtain an input feature map on which padding padding is performed. A size of the input feature map is W×H×k, W and H each are an even number not less than 4, k is a positive integer, W is a row of the input feature map, H is a column of the input feature map, and k is a quantity of channels of the input feature map. In this application, the quantity of channels of the input feature map is sometimes referred to as a quantity of input channels for short, and the quantity of channels of the input feature map and the quantity of input channels have a same meaning.
The padding may be understood as adding some pixels to the periphery of the input feature map, and initializing these pixels to 0 or another specified value. For an input feature map whose row and column each are not an even number not less than 4, pixels may be added to the periphery of the input feature map in a padding process, so that the row and the column of the input feature map each are an even number not less than 4.
It should be noted that calculation manners of padding in a related technology may be used in this embodiment of this application.
The traversal unit 3012 is configured to traverse the input feature map by using a sliding window whose stride is 2 and whose size is 4×4, to obtain (((W−2)(H−2)/4)×k) target matrices, where the target matrices each are an input feature map of an area corresponding to the sliding window.
may be considered as an area corresponding to the sliding window, and the matrix
is a target matrix. If it is considered that each target matrix further includes the dimension of the quantity of input channels, (((W−2)(H−2)/4)×k) target matrices may be obtained after the traversal unit 3012 traverses the input feature map.
The matrix transform unit 3013 is configured to perform the first forward winograd transform on a target matrix, to obtain a transformed target matrix.
to obtain a transformed target matrix
That is, the target matrix
is left multiplied by a matrix BT and right multiplied by a matrix B, to obtain the transformed target matrix.
The matrix transform unit 3013 is further configured to perform the second forward winograd transform on a convolution kernel whose size is 3×3×k×n and whose stride is 1, to obtain a transformed convolution kernel, where n is a quantity of channels of the output feature map.
to obtain a transformed convolution kernel
That is, the convolution kernel
is left multiplied by a matrix G and right multiplied by a matrix GT, to obtain the transformed convolution kernel.
The matrix operation module 302 is configured to determine the multiplication result of the first matrix and the second matrix. The first matrix includes an ith element in each transformed target matrix, i is a positive integer not greater than 16, the first matrix is a matrix with m rows and k columns, and m is equal to ((W−2)(H−2)/4). The second matrix includes an ith element in each transformed convolution kernel, and the second matrix is a matrix with K rows and n columns. The multiplication result is used to determine the output feature map.
In the winograd algorithm, a point multiplication operation should be performed on the transformed convolution kernel and the transformed target matrix. In this application, the point multiplication operation performed on the transformed convolution kernel and the transformed target matrix is converted into a multiplication operation between two matrices, so that the winograd algorithm can be applied to a convolutional neural network by using only the conventional matrix operation module 302 in such a design. The following describes an idea of how to construct the first matrix and the second matrix.
An ith element in each transformed target matrix is extracted, to form a matrix with m rows and k columns, where the matrix is the first matrix. In the description of
An ith element in each transformed convolution kernel is extracted, to form a matrix with K rows and n columns, and the matrix is the second matrix. As shown in
In the foregoing manner, the point multiplication operation between the transformed target matrix and the transformed convolution kernel may be converted into multiplication of the first matrix and the second matrix. With reference to
Element wise refers to performing an operation on corresponding elements in at least two matrices, for example, performing an operation on an ith element in one matrix and an ith element in another matrix, where the operation may include an addition operation, a subtraction operation, or the like.
Specifically, addition or subtraction is performed on 16 multiplication results, and Q1=P1+P2+P3, Q2=P2−P3−P4, Q3=P5+P6+P7, and Q4=P6-P7-P8 may be determined by using an inverse winograd transform formula, where P1=S0+S4+S8, P2=S1+S5+S9a, P3=S2+S6+S10, P4=S3+S7+S11, P5=S4−S8−S12, P6=S5−S9−S13, P7=S6−S10−S14, and P8=S7−S11−S15.
Q1, Q2, Q3, and Q4 may be used to determine the output feature map corresponding to the input feature map.
It can be learned that performing inverse winograd transform on the 16 multiplication results may be converted into performing an addition or subtraction operation on multiplication results of 16 matrices by using the conventional vector operation module 303, to output a third matrix, where the third matrix may include Q1, Q2, Q3, and Q4. The third matrix may be processed to obtain the output feature map.
In a possible implementation, if the input feature map is processed at a pooling layer, because common operations at the pooling layer usually include maximum value pooling and average value pooling, a maximum value or a sum of the four matrices Q1, Q2, Q3, and Q4 included in the third matrix may be obtained. (Q1+Q2+Q3+Q4)/4 is output during average value pooling, and MAX(Q1, Q2, Q3, Q4) is output during maximum value pooling. Data output according to the solution provided in this application, for example, (Q1+Q2+Q3+Q4)/4 and MAX(Q1, Q2, Q3, Q4), may be used as an expression form of the output feature map.
In a possible implementation, if the input feature map is processed at a convolutional layer, the elements in the third matrix further need to be reordered according to a preset reordering rule, to obtain the output feature map. With reference to
The following describes a principle of reordering the elements in the third matrix according to the preset reordering rule, to obtain the output feature map. A first element in each transformed target matrix is extracted to form a first matrix with m rows and k columns, a first element in each transformed convolution kernel is extracted to form a second matrix with K rows and n columns, and when i is 1, a multiplication result of the first matrix and the second matrix is S1; a second element in each transformed target matrix is extracted to form a first matrix with m rows and k columns, a second element in each transformed convolution kernel is extracted to form a second matrix with K rows and n columns, and when i is 2, a multiplication result of the first matrix and the second matrix is S2; and so on. If the first element in each of matrices S1 to S16 is extracted to form a matrix, for example, form a matrix 1, a 2×2 matrix may be output after inverse winograd transform is performed on the matrix 1, and each element in the 2×2 matrix includes a quantity of a plurality of output channels, that is, each element has the dimension of the quantity of output channels. The 2×2 matrix 1 is an output feature map corresponding to an input feature map of an area in which a first sliding window is located. For another example, if the second element in each of the matrices in S1 to S16 is extracted to form a matrix, for example, form a matrix 2, a 2×2 matrix may be output after inverse winograd transform is performed on the matrix 2, and each element in the 2×2 matrix includes a quantity of a plurality of output channels. The 2×2 matrix 2 is an output feature map corresponding to an input feature map of an area in which a second sliding window is located, and the second sliding window means that a sliding window whose stride is 2 slides once. An operation procedure for obtaining an ith element in the 2×2 matrix corresponding to the matrix 1 is the same as an operation procedure for obtaining an ith element in the 2×2 matrix corresponding to the matrix 2, and so on. An operation procedure for obtaining an ith element in a 2×2 matrix corresponding to a matrix i is the same. The matrix i is a matrix formed by all the ith elements extracted from the matrices S1 to S16. Therefore, inverse winograd transform is performed on the 16 multiplication results to output Q1, Q2, Q3, and Q4. Q1 includes the first elements in the matrix 1 to the matrix 16, Q2 includes the second elements in the matrix 1 to the matrix 16, Q3 includes the third elements in the matrix 1 to the matrix 16, and Q4 includes the fourth elements in the matrix 1 to the matrix 16. Therefore, after Q1, Q2, Q3, and Q4 are obtained, the elements in the third matrix need to be reordered according to the preset reordering rule, to obtain the output feature map. For understanding of a reordering manner, refer to
The foregoing describes the accelerator provided in this embodiment of this application. In the solution provided in this application, the winograd algorithm can be applied to a convolutional neural network by using a conventional matrix operation module and a conventional vector operation module in the general convolutional neural network. For a convolutional layer or pooling layer whose size is 3×3 and whose stride is 1, a quantity of multiplication times can be greatly reduced, to improve performance and an energy efficiency ratio of the accelerator.
As mentioned above, the ith element in each transformed target matrix is extracted to form a matrix with m rows and k columns, and the matrix is a first matrix. To further improve the performance of the accelerator, a plurality of elements in each transformed target matrix may be extracted at a time, and a plurality of first matrices are output at a time. For example, the following provides descriptions with reference to several specific implementations.
A manner of performing forward winograd transform on each target matrix, to convert the target matrix into a transformed target matrix may be represented by using the following formula 2-2.
In the formula, m00−P00−P20−P02+P22, m10−P10+P20+P12+P22, m20−P20−P10−P22+P12, and m30=P10−P30−P12±P32. It can be learned that a first column and a third column of the target matrix are used for operations of m00, m10, m30, and m00. m01=P01−P21±P02−P22, m11=P11+P21+P12+P22, m21=P21−P11±P22−P12, and m31=P11−P31+P12−P32. It can be learned that a second column and the third column of the target matrix are used for operations of m01, m11, m21, and m31. m02=P02−P22−P01+P21, m12−P22+P12−P11−P21, m22−P22−P12−P21+P11, and m32−P12−P32−P11±P31. It can be learned that the second column and the third column of the target matrix are used for operations of m02, m12, m22, and m32. m03−P01−P21−P03+P23, m13−P11+P21−P13−P23, m23=P21−P11−P23+P13, and m33=P11−P31−P13+P33. It can be learned that the second column and a fourth column of the target matrix are used for operations of m03, m13, m23, and m33. With reference to
The following provides descriptions with reference to several embodiments. It can be learned from the foregoing descriptions that there are cross parts in a process of calculating each element in a transformed target matrix. For example, elements of a first column and elements of a third column in a target matrix are used for calculating elements of a first column in the transformed target matrix. In this case, with reference to
To further improve the performance of the accelerator, a plurality of elements in each convolution kernel may be extracted at a time, and a plurality of second matrices are output at a time. There are cross parts in a process of calculating each element in a transformed convolution kernel. The following provides a description with reference to a formula 2-3.
q00=k′00, q10=(k′00+k′10+k′20)/2, q20=(k′00−k′10+k′20)/2, and q30=k′20. It can be learned that the first column of the convolution kernel is used for operations of q00, q10, q20, and q30. q01=(k′00+k′01+k′02)/2, q11=(k′00+k′01+k′02+k′10+k′11+k′12+k′20+k′21+k′22)/4, q21=(k′00+k′01+k′02−k′10−k′11−k′12+k′20+k′21+k′22)/4, and q31=(k′20+k′21+k′22)/2. It can be learned that each column of the convolution kernel is used for operations of q0, q10, q20, and q30. q02=(k′00−k′01+k′02)/2, q12=(k′20−k′01+k′02+k′10+k′11+k′12+k′20−k′21+k′22)/4, q22=(k′00−k′01+k′02−k′10+k′11−k′12+k′20−k′21+k′22)/4, and q32=(k′02−k′21+k′22)/2. It can be learned that each column of the convolution kernel is used for operations of q02, q12, q22, and q32. q03=k′02, q13=(k′02+k′12 k′22)/2, q23=(k′02−k′12+k′22)/2, and q33=k′22. It can be learned that the third column of the convolution kernel is used for operations of q00, q10, q20, and q30.
A manner of performing forward winograd transform on each convolution kernel to convert the convolution kernel into a transformed convolution kernel may be represented by using the formula 2-3. There are cross parts in a process of calculating each element in a transformed convolution kernel. An operation may be performed by performing vector addition and subtraction between elements in a convolution kernel, to output a plurality of transformed convolution kernels, or output some elements in a plurality of transformed convolution kernels. To improve parallelism, each point has quantities of all or some of input channels and output channels.
It should be noted that, corresponding to different bandwidth and storage requirements, when 16 first matrices or 16 second matrices are output, there may be a plurality of calculation orders.
In a possible implementation, to reduce a calculation amount of the matrix transform unit in the accelerator, a process of the second forward winograd transform may be performed offline. To be specific, the accelerator provided in this application further includes a storage module, the storage module is configured to store a result of the second forward winograd transform, and another module in the accelerator may directly invoke the result of the second forward winograd transform prestored in the storage module. In a possible implementation, a part of the process of the second forward winograd transform may alternatively be performed on a chip, and another part of the process of the second forward winograd transform may be performed offline. This is described below by using examples.
The second forward winograd transform includes third forward winograd transform and fourth forward winograd transform. The neural network accelerator further includes the storage module, and the storage module is configured to store a first transformation result of performing the third forward winograd transform on the convolution kernel by using the third matrix. The matrix transform unit is specifically configured to perform the fourth forward winograd transform on the first transformation result by using a fourth matrix, to obtain a transformed convolution kernel. The third matrix and the fourth matrix are matrices obtained after a transformation matrix of the second forward winograd transform is decomposed, a value of an element in the third matrix is 0 or ±1, and the fourth matrix is a matrix other than the third matrix in the matrices obtained after decomposition. The following uses an example for description. G×K×GT=V may be converted into a formula 2-4:
V=G×K×G
T
=GL×(GR×K×GRT)×GLT=GL×Wm×GLT (2-4)
Wm=GR×K×GRT may be performed offline, and this result may be prestored in the storage module, and GL×Wm×GLT may be performed on a chip. A transformation matrix G of the second forward winograd transform is split into a 3×3 matrix GR (2-5) and a 4×3 matrix GL (2-6). It should be noted that there may be another splitting manner, to ensure that all elements in one matrix in transformation matrices obtained after splitting are 0 or ±1.
To meet a requirement of a convolution operation of a fixed point number, the solution provided in this application may support de-quantization and quantization processing. In a possible implementation, the vector operation module may support de-quantization (De-quantization) and quantization (Quantization) operations, to meet a requirement of an operation of a fixed point number. De-quantization may be used to convert a fixed-point number into a floating point number or another fixed point number that facilitates an operation of the vector operation module, for example, s32->f16 and s32->s16. Quantization is used to convert a result of the vector operation module after reordering into a fixed point number input of a next-layer operation, for example, s16->s8 and f16->s8. In a possible implementation, de-quantization may be performed before inverse winograd transform, and quantization may be performed after inverse winograd transform. A de-quantization process may be performed before an inverse transform operation, so that a bit width can be reduced and computing power is greater. It should be noted that specific manners of quantization and de-quantization are not limited in this embodiment of this application.
In a possible implementation, an offset operation may be performed on at least one multiplication result. In the solution provided in this application, performing an offset operation on a multiplication result may be equivalent to performing an offset operation on an output feature map. This is proved as follows:
In the foregoing formula, b represents an offset, and one value
of c may be obtained according to the formula 2-7.
It can be learned that performing an offset operation on a fifth multiplication result may be equivalent to performing an offset operation on the output feature map.
In a possible implementation, to reduce a calculator time of the accelerator, operations in the matrix transform unit and the vector operation module may be on-the-fly calculations. For example, a function of the matrix transform unit may be fixed into an instruction for invocation. The matrix transform unit may be included in a process of transferring data from an upper-layer memory to the matrix operation module, that is, in a process of transferring data stored in the upper-layer memory to the matrix operation module, to process the data. A processing process is understood with reference to an operation performed by the matrix transform unit. For another example, an offset operation, a de-quantization operation, or a part of inverse winograd transform of the vector operation module may be completed by through on-the-fly calculation.
In a possible implementation, as shown in
An embodiment of this application further provides an acceleration method. The acceleration method may include the following steps: performing first forward winograd transform on a target matrix corresponding to an input feature map, to obtain a transformed target matrix; performing second forward winograd transform on a convolution kernel, to obtain a transformed convolution kernel; performing a matrix multiplication operation on a first matrix and a second matrix, to obtain a multiplication result, where the first matrix is constructed based on the transformed target matrix, and the second matrix is constructed based on the transformed convolution kernel; and performing inverse winograd transform on the multiplication result, to obtain an output feature map.
In a possible implementation, the method further includes: performing a padding padding operation on the input feature map, so that a size of the input feature map is W×H×k, where W and H each are an even number not less than 4, k is an integer greater than 1, W is a row of the input feature map, H is a column of the input feature map, and k is a quantity of channels of the input feature map. The input feature map is traversed by using a sliding window whose stride is 2 and whose size is 4×4, to obtain (((W−2)(H−2)/4)×k) target matrices.
In a possible implementation, a padding padding operation is performed on the input feature map, so that the size of the input feature map is W×H×k, where W and H each are an even number not less than 4, k is an integer greater than 1, W is the row of the input feature map, H is the column of the input feature map, and k is the quantity of channels of the input feature map. The input feature map is traversed by using the sliding window whose stride is 2 and whose size is 4×4, to obtain (((W−2)(H−2)/4)×k) target matrices.
In a possible implementation, a size of the convolution kernel is 3×3×k×n, a stride of the convolution kernel is 1, n is a quantity of channels of the output feature map, and n is an integer greater than 1.
In a possible implementation, the first matrix includes an ith element in the transformed target matrix, i is a positive integer not greater than 16, the first matrix is a matrix with m rows and k columns, and m is equal to ((W−2)(H−2)/4). The second matrix includes an ith element in the transformed convolution kernel, and the second matrix is a matrix with K rows and n columns. The multiplication result is used to determine the output feature map.
In a possible implementation, the performing inverse winograd transform on the multiplication result, to obtain an output feature map includes: performing the inverse winograd transform on the multiplication result to obtain a third matrix; and reordering elements in the third matrix by using a preset reordering rule, to obtain the output feature map.
In a possible implementation, the performing inverse winograd transform on the multiplication result, to obtain an output feature map includes: performing the inverse winograd transform on the multiplication result to output a third matrix; and performing a summation operation on elements in the third matrix, to obtain the output feature map.
In a possible implementation, the second forward winograd transform includes third forward winograd transform and fourth forward winograd transform, and the performing second forward winograd transform on a convolution kernel whose size is 3×3×k×n and whose stride is 1, to obtain a transformed convolution kernel includes: performing the third forward winograd transform on the convolution kernel by using the third matrix, to obtain a first transformation result; and performing the fourth forward winograd transform on the first transformation result by using a fourth matrix, to obtain the transformed convolution kernel, where the third matrix and the fourth matrix are matrices obtained after a transformation matrix of the second forward winograd transform is decomposed, a value of an element in the third matrix is 0 or ±1, and the fourth matrix is a matrix other than the third matrix in the matrices obtained after decomposition.
In a possible implementation, the method further includes: obtaining M elements of a plurality of transformed target matrices, where M is an integer greater than 1; processing the M elements according to a first preset formula, to output a plurality of first matrices; obtaining N elements of a plurality of transformed convolution kernels, where N is an integer greater than 1; and processing the N elements according to a second preset formula, to output a plurality of second matrices.
In a possible implementation, the method further includes: performing an offset operation on a multiplication result.
An embodiment of this application further provides a computer-readable storage medium. The computer-readable storage medium stores a program used for acceleration. When the program is run on a computer, the computer is enabled to perform the steps performed by the neural network accelerator described in the embodiments shown in
The neural network accelerator in this application may also be implemented by using a digital processing chip or a chip. The chip includes a processing unit and a communication interface. The processing unit obtains program instructions through the communication interface, the program instructions are executed by the processing unit, and the processing unit is configured to perform the method steps performed by the neural network accelerator shown in any embodiment in
An embodiment of this application further provides a digital processing chip. The digital processing chip implements, based on program code stored in an external memory, the actions performed by the neural network accelerator in the foregoing embodiments.
An embodiment of this application further provides a computer program product. When the computer program product runs on a computer, the computer is enabled to perform the steps performed by the neural network accelerator in the methods described in the embodiments shown in
The neural network accelerator provided in this embodiment of this application may be a chip. The chip includes a processing unit and a communication unit. The processing unit may be, for example, a processor, and the communication unit may be, for example, an input/output interface, a pin, or a circuit. The processing unit may execute computer-executable instructions stored in a storage unit, to enable a chip in a server to perform the steps performed by the neural network accelerator described in the embodiments shown in
Specifically, the processing unit or the processor may be a central processing unit (central processing unit, CPU), a network processing unit (neural-network processing unit, NPU), a graphics processing unit (graphics processing unit, GPU), a digital signal processor (digital signal processor, DSP), an application-specific integrated circuit (application-specific integrated circuit, ASIC), a field programmable gate array (field programmable gate array, FPGA), another programmable logic device, a discrete gate, a transistor logic device, a discrete hardware component, or the like. A general-purpose processor may be a microprocessor, any regular processor, or the like.
Steps specifically performed by the matrix operation module 302 may be understood with reference to the steps performed by the matrix operation module 302 described in any embodiment in
The chip further includes a preprocessing module 301. Specific steps performed by the preprocessing module may be understood with reference to the steps performed by the preprocessing module described in any embodiment in
A bus interface unit (bus interface unit, BIU) 310 is used for interaction between an AXI bus and a DMAC and between the AXI bus and an instruction fetch buffer (Instruction Fetch Buffer, IFB) 309.
The bus interface unit (bus interface unit, BIU) 310 is used by the instruction fetch buffer 309 to obtain instructions from an external memory, and is further used by a storage unit access controller 306 to obtain original data of an input matrix A or a weight matrix B from the external memory.
Steps specifically performed by a vector operation module 303 may be understood with reference to the steps performed by the vector operation module 303 described in any embodiment in
In some implementations, the vector operation module 303 can store a processed output vector in a unified memory 307. For example, the vector operation module 303 may apply a linear function and/or a non-linear function to an output of the matrix operation module 302, for example, perform linear interpolation on a feature plane extracted at a convolutional layer, and for another example, obtain an accumulated value vector, to generate an activation value. In some implementations, the vector operation unit 303 generates a normalized value, a pixel-level summation value, or both. In some implementations, the processed output vector can be used as an activation input of the matrix operation module 302, for example, used at a subsequent layer in a neural network.
The instruction fetch buffer (instruction fetch buffer) 309 connected to the controller 308 is configured to store an instruction used by the controller 308.
The unified memory 307, an input memory 305, a weight memory 304, and the instruction fetch buffer 309 each are an on-chip memory. The external memory is private for a hardware architecture of the NPU.
An operation at each layer in a recurrent neural network may be performed by the matrix operation module 302 or the vector operation module 303.
Any processor mentioned above may be a general-purpose central processing unit, a microprocessor, an ASIC, or one or more integrated circuits configured to control program execution of the methods in
A data stream indicates obtaining data, which may include an input feature map and a weight, from the external memory by using the bus interface unit 310, and storing the obtained data in the unified memory. The storage unit access controller controls the unified memory, so that data in the unified memory is transmitted to the matrix transform unit, data output by the matrix transform unit is transmitted to the weight memory 304 and the input memory, the weight memory 304 and the input memory output data to the matrix operation module, data output by the matrix operation module is transmitted to the vector operation module, an output result of the vector operation module is stored in the unified memory, and the result can be output to an external bus.
In addition, it should be noted that the described apparatus embodiments are merely examples. The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one position, or may be distributed on a plurality of network units. Some or all the modules may be selected according to actual needs to achieve the objectives of the solutions of embodiments. In addition, in the accompanying drawings of the apparatus embodiments provided by this application, connection relationships between modules indicate that the modules have communication connections with each other, which may be specifically implemented as one or more communications buses or signal cables.
Based on the description of the foregoing implementations, a person skilled in the art may clearly understand that this application may be implemented by software in addition to necessary universal hardware, or by dedicated hardware, including a dedicated integrated circuit, a dedicated CPU, a dedicated memory, a dedicated component, and the like. Generally, any functions that can be performed by a computer program can be easily implemented by using corresponding hardware. Moreover, a specific hardware structure used to achieve a same function may be in various forms, for example, in a form of an analog circuit, a digital circuit, or a dedicated circuit. However, as for this application, software program implementation is a better implementation in most cases. Based on such an understanding, the technical solutions of this application essentially or the part contributing to the conventional technology may be implemented in a form of a software product. The computer software product is stored in a readable storage medium, for example, a floppy disk, a USB flash drive, a removable hard disk, a read-only memory (read-only memory, ROM), a random access memory (random access memory, RAM), a magnetic disk, or an optical disc of a computer, and includes several instructions for instructing a computer device (which may be a personal computer, a server, a network device, or the like) to perform the methods described in embodiments of this application.
All or some of the foregoing embodiments may be implemented by using software, hardware, firmware, or any combination thereof. When software is used to implement the embodiments, all or a part of the embodiments may be implemented in a form of a computer program product.
The computer program product includes one or more computer instructions. When the computer program instructions are loaded and executed on the computer, the procedure or functions according to embodiments of this application are all or partially generated. The computer may be a general-purpose computer, a dedicated computer, a computer network, or other programmable apparatuses. The computer instructions may be stored in a computer-readable storage medium or may be transmitted from a computer-readable storage medium to another computer-readable storage medium. For example, the computer instructions may be transmitted from a web site, computer, server, or data center to another web site, computer, server, or data center in a wired (for example, a coaxial cable, an optical fiber, or a digital subscriber line (DSL)) or wireless (for example, infrared, radio, or microwave) manner. The computer-readable storage medium may be any usable medium accessible by a computer, or a data storage device, for example, a server or a data center, integrating one or more usable media. The usable medium may be a magnetic medium (for example, a floppy disk, a hard disk, or a magnetic tape), an optical medium (for example, a DVD), a semiconductor medium (for example, a solid-state disk (solid-state disk, SSD)), or the like.
This application is a continuation of International Application No. PCT/CN2020/118832, filed on Sep. 29, 2020, the disclosure of which is hereby incorporated by reference in its entirety.
Number | Date | Country | |
---|---|---|---|
Parent | PCT/CN2020/118832 | Sep 2020 | US |
Child | 18191134 | US |