The present invention relates to an apparatus for processing a neural network.
A processing flow for typical Convolutional Neural Network (CNN) is presented in FIG. 1. Typically, the input to the CNN is at least one 2D image/map 10 corresponding to a region of interest (ROI) from an image. The image/map(s) can comprise image intensity values only, for example, the Y plane from a YCC image; or the image/map(s) can comprise any combination of colour planes from an image; or alternatively or in addition, the image/map(s) can contain values derived from the image such as a Histogram of Gradients (HOG) map as described in PCT Application No. PCT/EP2015/073058 (Ref: FN-398), the disclosure of which is incorporated by reference, or an Integral Image map.
CNN processing comprises two stages:
CNN feature extraction 12 typically comprises a number of processing layers 1 . . . N, where:
2-D or 3-D convolution kernels have A×B or A×B×C values or weights respectively, pre-calculated during a training phase of the CNN. Input map pixel values are combined with the convolution kernel values using a dot product function. After the dot product is calculated, an activation function is applied to provide the output pixel value. The activation function can comprise a simple division, as normally done for convolution, or a more complex function such as sigmoid function, a rectified linear unit (ReLU) activation function or PReLU (Parametric ReLU) function, as typically used in neural networks.
The layers involved in CNN feature classification 14 are typically as follows:
The CNN is trained to classify the input ROI into one or more classes or to detect an object with an image. For example, for a ROI potentially containing a face, a CNN might be used to determine if the face belongs to an adult or a child; if the face is smiling, blinking or frowning. For ROI potentially containing a body, the CNN might be used to determine a pose for the body.
Once the structure of the CNN is determined, i.e. the input maps, the number of convolution layers; the number of output maps; the size of the convolution kernels; the degree of subsampling; the number of fully connected layers; and the extent of their vectors—the weights to be used within the convolution layer kernels and the fully connected layers used for feature classification are determined by training against a sample data set containing positive and negative labelled instances of a given class, for example, faces labelled as smiling and regions of interest containing non-smiling faces. Suitable platforms for facilitating the training of a CNN are available from: PyLearn which is based on Theano and MatConvNet which is in turn based on Caffe; Thorch; or TensorFlow. It will nonetheless be appreciated that the structure chosen for training may need to be iteratively adjusted to optimize the classification provided by the CNN.
PCT Application WO 2017/129325 (Ref: FN-481-PCT) and PCT Application No. PCT/EP2018/071046 (Ref: FN-618-PCT), the disclosures of which are herein incorporated by reference, disclose CNN Engines providing a platform for processing layers of a neural network. Image information is acquired across a system bus and the image scanned with pixels of the input image being used to generate output map pixels. The output map pixels are then used by the CNN Engine as inputs for successive layers of the network. In each of these cases, the CNN Engine comprises a limited amount of on-board cache memory enabling input image and output map pixels to be stored locally rather than having to be repeatedly read and written across the system bus.
In order to minimize the amount of on-board memory required by the CNN Engine, processing of an input image can be broken down into tiles, such as disclosed at: https://computer-vision-talks.com/tile-based-image-processing/.
FIG. 2 shows an exemplary portion of a convolutional neural network layer structure which it may be desired to process. As will be seen, an input image tile is fed through 4 successive convolution and pooling layers before the resulting feature vector is fed to a fully connected network. A 64×64 pixel input image tile fed through 4 3×3 convolutional layers, each followed by a 2×2 pooling layer produces 3×3 pixels of information for a feature vector which can then be fed to a fully connected network. (Note that in this case, an edge image tile is chosen and one of the pixels is designated as a padding pixel, so allowing 63 pixels of width wide information to be used to generate 62 pixels of width wide output map information—for image tiles from the centre of an image, 64 pixels of input image information produce 62 pixels of output map information.)
FIG. 3 shows the processing of pixels from such a tile graphically. In this case, it can be seen that because of the effect of padding and especially pooling, the value of an output map pixel 35 is affected by input image tile pixels laterally offset from an output pixel location. In this case, input image pixel 33 which is 32 pixels offset from an output map pixel 35 is a factor in calculating the value for output map pixel 35. Note that the same applies in the vertical offset direction.
Referring now to FIG. 4, in order to ensure that input image tile pixel information is available for any given output map pixel location from a given tile, input image tiles are typically provided for processing to a processing engine on an overlapping basis.
Thus, in the present example, for a 256×256 pixel image divided into 64×64 pixel tiles, 49 tiles will need to be provided to a CNN engine with each tile overlapping a previous tile by 32 pixels (50%). Clearly, this involves significant overhead when reading image information from system memory to the CNN engine.
U.S. Pat. No. 7,737,985 discloses graphics circuitry including a cache separate from the device memory, to hold data, including buffered sub-image cell values. The cache is connected to the graphics circuitry so that pixel processing portions of the graphics circuitry access the buffered sub-image cell values in the cache, in lieu of the pixel processing portions directly accessing the sub-image cell values in the device memory. A write operator writes the buffered sub-image cell values to the device memory under direction of a priority scheme. The priority scheme preserves in the cache border cell values bordering one or more primitive objects.
It is an object of the present invention to provide an apparatus for processing a neural network which does not have the same bandwidth requirement when reading image information from system memory.
According to the present invention, there is provided an apparatus for processing a neural network according to claim 1.
An embodiment of the invention will now be described, by way of example, with reference to the accompanying drawings, in which:
Referring now to
Typically, such systems comprise a central processor (CPU) 30 which communicates with other peripherals within the system, for example, one or more cameras (not shown) to acquire images for processing and to store these in system memory 40 before coordinating with the CNN Engine 50 to process the images.
Within a core of the CNN Engine 50, there is provided a cache 60 in which both input tile and output map information for a given tile is stored in a portion 70 of the cache.
A controller 90, usually triggered by the CPU 30, enables the CNN Engine 50 to read an image, tile-by-tile from system memory 40 and to write this into the cache 60.
Weight information can also be stored in a cache and as disclosed in PCT Application WO 2017/129325 (Ref: FN-481-PCT) and PCT Application No. PCT/EP2018/071046 (Ref: FN-618-PCT), this can be in a separate memory from that storing the input tile and output map information or it can be stored in a different portion 80 of the memory 60 from the input tile and output map information 70.
In one implementation, the controller 90 enables the CNN Engine 50 to acquire the required weights information from system memory 40 for processing the required layers of a network. In other embodiments, the weights information can be pre-stored within the CNN Engine 50 so avoiding the need to read this across the system bus.
The controller 90 is also able to obtain network configuration information so that it knows the parameters and weights for each layer of the neural network to be processed. Again, this can be pre-stored within the CNN Engine 50 or the CPU 30 can provide the configuration information to the CNN Engine 50 at run-time.
For any given layer of a CNN to be processed, once the input image tile or input map (an output map generated from a previous layer of the network) and the required weights are available, these are fed to a layer processing engine 95 where the input information is combined with the weight information to generate output map information which is written back into memory 70 within the cache 60.
Unlike in conventional processing however, the layer processing engine 95 not only uses input image tile information from a given tile stored in memory 70, but the layer processing engine 95 uses certain limited information stored from the processing of previous input image tiles in order to reduce the overlap required between input image tiles read from system memory 40.
Referring to
While this is also true of some output map values from previous layers of the network, it will be seen that the number of input pixels 65 affecting the values 75(a), 75(b) is relatively larger and more offset than the input pixels affecting output map values for earlier layers of the network including after pooling layers 1 and 2.
Embodiments of the present invention leverage this fact by storing a limited number of output map pixels for a layer after at least one pooling step in a convolutional neural network in memory for use in processing subsequent tiles. Preferably, certain output map pixels for a layer after two pooling steps of the convolutional neural network are stored and most preferably certain output map pixels for a layer after three pooling steps of the convolutional neural network are stored.
Note however, that the pixels from one tile which are to be stored for processing of a subsequent tile are not drawn from the output map after the last pooling layer, in this case pooling layer 4, as these pixels do not lie in the overlap region between tiles—or if they did, the overlap between tiles would be so great or the tiles so large that the benefits of the present approach would be mitigated.
In the embodiment, it is assumed that tiles are acquired from a top corner of an image left-to-right, row-by-row, but it will be appreciated that tiles can be read in reverse or flipped order and the embodiment adjusted accordingly.
As shown in
While
In the embodiment, these values are stored in a portion 72 of cache memory 60, although it will be appreciated that values can also be stored in a separate dedicated memory from memory 60.
In any case, referring back to
This means that when determining an output map value 76 after convolution layer 4 for tile Tn, the layer processing engine 95 will use a combination of pre-stored values 75(a), 75(b) generated during the processing of previous tile Tx as well as a value 75(c) obtained from the output map after pooling layer 3 when processing current tile Tn.
The value 75(a) essentially acts as a proxy for the map information contained in the pixels 65 and so allows the overlap between image tiles to be reduced from the 32 pixels shown in
Correspondingly, during later processing of pooling layer 3, output map pixels 85(a) and 85(b) from adjacent the right boundary of the tile are not alone written back to memory 70, but they are also stored in memory 72. Thus, when processing subsequent tile Tz, output map values 85(a) and 85(b) can be retrieved from memory 72 and used in conjunction with output map value 85(c) only calculated during the processing of tile Tz to generate output map value 86.
Turning again to
Note that once the processing of convolution layers is complete, the CNN Engine 50 can continue by generating a feature vector comprising a combination of the output maps generated from the processing of the image tiles in an otherwise conventional fashion to produce the required classification of an image and this is not described in further detail here.
The present approach enables the overlap between tiles vis-à-vis the example shown in
It will be appreciated that when using 64×64 pixel tiles, where the CNN Engine 50 is dedicated for processing a given neural network schema, the switching involved in selecting and storing boundary pixels from/in memory 72 can be hardwired using simple shifting and multiplexing circuitry and can be done with no processing delays.
On the other hand, the CNN Engine 50 could be implemented as an extension of the CNN engines disclosed in PCT Application WO 2017/129325 (Ref: FN-481-PCT) and PCT Application No. PCT/EP2018/071046 (Ref: FN-618-PCT) where the controller 90 is configurable to select for which layer, or possibly layers, boundary pixels are to be stored in a separate memory for use in processing subsequent tiles.
Number | Name | Date | Kind |
---|---|---|---|
7737985 | Torzewski et al. | Jun 2010 | B2 |
10205950 | Terada | Feb 2019 | B2 |
10810491 | Xia | Oct 2020 | B1 |
20050134588 | Aila | Jun 2005 | A1 |
20130107973 | Wang | May 2013 | A1 |
20160219298 | Li | Jul 2016 | A1 |
20160364644 | Brothers | Dec 2016 | A1 |
20180150721 | Mostafa | May 2018 | A1 |
20180250744 | Symeonidis | Sep 2018 | A1 |
20190171930 | Lee | Jun 2019 | A1 |
20190197083 | Chen | Jun 2019 | A1 |
20190220734 | Ferdman | Jul 2019 | A1 |
20190220742 | Kuo | Jul 2019 | A1 |
20190243755 | Luo | Aug 2019 | A1 |
20200159809 | Catthoor | May 2020 | A1 |
20200293813 | Shibata | Sep 2020 | A1 |
20200410633 | Pieters | Dec 2020 | A1 |
Number | Date | Country |
---|---|---|
2016083002 | Jun 2016 | WO |
2017129325 | Dec 2016 | WO |
2019042703 | Mar 2019 | WO |
Entry |
---|
Ievgen Khvedchenia, “Tile-based image processing”, www.computer-vision-talks.com/tile-based-image-processing/, 2015. |
Number | Date | Country | |
---|---|---|---|
20200057919 A1 | Feb 2020 | US |
Number | Date | Country | |
---|---|---|---|
62719605 | Aug 2018 | US |