The present disclosure relates to convolutional neural networks. In particular, the present disclosure relates to a method and apparatus of reducing computational complexity of convolutional neural networks.
A convolutional neural network (CNN) system is a type of feed-forward artificial neural network that has many applications. CNN systems have evolved to be state-of-the-art in the field of machine learning, for example, in object detection, image classification, scene segmentation, and image quality improvement such as super resolution, and disparity estimation.
CNN systems generally include multiple layers of convolutional filters (also referred to as “weight kernels” or just “kernels”). Each convolution layer may receive a feature map as input, which is convolved with a kernel to generate a convolved output. Due to the large number of feature maps that may require processing at each layer, the large kernel sizes, and an increasing number of layers in deep neural networks, training and running CNN systems are generally computationally expensive. The complexity also increases with a larger input size (e.g., a full high definition (HD) image), which translates into a larger width and height of input feature maps, and all intermediate feature maps.
Many applications such as pedestrian detection require fast real-time processing. Current hardware architecture and graphical processing units (GPUs) aim at parallel processing on multiple processing units to speed up the process. However, due to the recent trend of implementing deep CNN systems on power limited electronic devices such as mobile devices, it is desirable to reduce computational burden to reduce power consumption and speed up processing time.
Disclosed herein is convolutional neural network (CNN) system for generating a classification for an input image received by the CNN system. In one aspect, the CNN system comprises circuitry running on clock cycles, wherein the circuitry is configured to compute a product of two received values, and at least one non-transitory computer-readable medium that stores instructions for the circuitry to: derive a feature map based on at least the input image; puncture at least one selection among the feature map and a kernel by setting the value of an element at an index of the at least one selection to and cyclic shifting a puncture pattern to achieve a 1/d reduction in number of clock cycles, where d is an integer, puncture interval value >1. The instructions are also for the circuitry to convolve the feature map with the kernel to generate a first convolved output; store the first convolved output in a register, and generate the classification for the input image based on at least the first convolved output.
Further disclosed herein is computer-implemented method of generating a classification for an input image, wherein the method is performed by a convolutional neural network (CNN) system implemented by one or more computers, and wherein the CNN system includes a sequence of neural network layers. The computer-implemented method comprises: causing circuitry to execute instructions stored in at least one non-transitory computer-readable medium, the instructions being for deriving, by the neural network layers, a feature map based on at least the input image; puncturing, by the neural network layers, at least one selection among the feature map and a kernel by setting the value of an element at an index of the at least one selection to zero, and cyclic shifting a puncture pattern to achieve a 1/d reduction in number of clock cycles, wherein d is an integer and puncture interval value >1. The instructions also cause the circuitry to convolved, by neutral network layers, the feature map with the kernel to generate a first convolved output, to store the first convolved output in a register, and to generate, by the neural network layers, the classification for the input image based on at least the first convolved output.
The accompanying drawings, which are included as part of the present disclosure, illustrate various embodiments and together with the general description given above and the detailed description of the various embodiments given below serve to explain and teach the principles described herein.
The figures in the drawings are not necessarily drawn to scale and elements of similar structures or functions are generally represented by like reference numerals for illustrative purposes throughout the figures. The figures are only intended to facilitate the description of the various embodiments described herein and do not describe every aspect of the teachings disclosed herein and do not limit the scope of the claims.
Each of the features and teachings disclosed herein may be utilized separately or in conjunction with other features and teachings to provide the present system and method. Representative examples utilizing many of these features and teachings, both separately and in combination, are described with reference to the attached figures. While the detailed description herein illustrates to a person of ordinary skill in the art further details for practicing aspects of the present teachings, it does not limit the scope of the claims. Therefore, combinations of features disclosed in the detailed description are representative examples of the present teachings and may not be necessary to practice the teachings in the broadest sense.
As discussed earlier, training and running CNN systems are typically computationally expensive.
The convolution of
Yo=Σc=1CXc*Ko,c
and implemented as follows:
The convolution may be performed by repetitive use of an array of multiply accumulate (MAC) units. A MAC is a normal sequential circuit that computes the product of two received values, and accumulates the result in a register.
According to the above implementation, for each output channel o, and for each element (w,h,o) in the output Y, a total of R×S multiplications are required, making the number of multiplications needed for each output channel to be W×H×R×S. Further, since each multiplication is followed by an accumulation, the number of MAC operations needed by the standard algorithm for all M output channels is equal to M×C×W×H×R×S, which may be quite substantial depending on the values of the dimensions. Thus, the present system and method is directed towards reducing the computational cost and complexity of convolution operations in a CNN system by puncturing (i.e., setting to zero) elements of the feature maps, kernels, or both, thereby skipping certain computations.
In other words, the present system and method exploits the redundancy in the kernels and/or the feature maps to reduce computation complexity for hardware implementation, which allows some MAC operations to be skipped. Skipping a MAC operation is equivalent to having one of the operands in the multiplication to be zero. To illustrate, consider a value (e.g., a pixel) in an input feature map as a first operand and a weight element in a kernel as a second operand. According to example embodiments of the present system and method, there are at least three approaches to reduce the computational complexity: (1) puncture the input feature map by overwriting some values to zero; (2) puncture the kernel by overwriting some values to zero; or (3) puncture both the input feature map and the kernel by overwriting some values of each to zero. Regular puncturing of feature maps is similar to subsampling to avoid loss of important features.
According to an embodiment, the present system and method provides regular puncturing of feature maps but not kernels to reduce computation and implementation complexity. The present system and method may further recover accuracy of the original network by fine-tuning of networks with regular puncturing of feature maps. According to another embodiment, the present system and method provides regular puncturing of kernels to reduce computation complexity. The present system and method may further recover accuracy of the original network by fine-tuning of networks with regular puncturing of kernels. According to another embodiment, the present system and method provides regular puncturing of both feature maps and kernels to reduce computation complexity. The present system may further recover accuracy of the original network by fine-tuning of networks with regular puncturing of both feature maps and kernels.
As
A convolved output (e.g., 203, 303, and 404) may be used as a feature map for input into a subsequent neural network layer or may be used to derive a feature map for input into a subsequent neural network layer. For example, in the case of
According to an embodiment of the present system and method, a CNN system may puncture the feature map and/or the kernel based on the index of the elements thereof. For example, if a list of indices in an input feature map that is punctured is represented by P_X and a list of indices in a kernel that is punctured is represented by P_K, the convolutional operation may be updated as follows:
If P_X represents A % of the indices of the feature map, then around A % of multiplications may be saved by puncturing the feature map. Similarly, if P_K represents B % of the indices of the kernel, then another B % of the MACs may be saved by puncturing the kernel. However, for speed, the MAC operations may be performed in parallel, or in batch, where each processor may handle a set number of MACs. As such, a random puncturing of feature map and/or kernel may not provide actual reduction in computational complexity with batch processing.
For some convolutional neural networks, the kernels may be operated with a stride on the input feature map. This is equivalent to skipping some convolutional operations on the input feature maps. For example, with a stride of p, the algorithm may be modified as follows, and the complexity, i.e., a number of MAC operations, is reduced to M×C×W×H×R×S/p{circumflex over ( )} 2.
If the stride is greater than 1, then the computational complexity reduces by p{circumflex over ( )}2. In some cases, the stride p may be implemented by design, for example, to achieve a specific receptive field size and not necessarily to reduce complexity.
According to an embodiment, the present system and method provides regular puncturing of feature maps to reduce computational complexity, whether a stride is 1 or greater. Consider a feature map mask T, shown in
Similarly, regular puncturing of the oth kernel may be implemented as follows using a similar procedure:
If the kernel size S>=d, and R>=d, then for each stride of the kernel on the feature map, there is about 1/d reduction in computational complexity. For example, consider S=R=d=4 as in the above example. The number of MACs required for this stride p is 12×C. Hence, the number of total MACs required for this convolutional layer is equal to (d−1)/d×M×C×W×H×R×S/p{circumflex over ( )}2.
According to an embodiment, the same mask is applied to all C input feature maps. One form of parallel processing is to batch process the MACs at the same w,h location. The following implementation checks the mask T to determine whether a MAC operation should be performed for a particular location (w+x, h+y):
By puncturing the feature map using a regular puncturing pattern, a processor of the CNN system is able to skip the C MAC operations every d increments in the width and height dimensions instead of checking if the position T[w+x][h+y] is masked or not. Hence, utilizing the regular puncturing pattern would result in an actual 1/d reduction in the required clock cycles with a parallel (batch) processor.
In some cases, by visual inspection or by experimentation or by measuring the sum of absolute values of a feature map, it may be determined that a certain feature map at each layer is too important to skip. In such a case, the masking may be entirely skipped for this feature map. After batch processing for C feature maps, the remaining MAC operations due to the non-masked feature maps may be processed and added to the sum at output position [w][h][o]. Such operations may be processed in parallel as well.
It is observed that there are d possible cyclic shifts of the mask T above for the convolutional layer under consideration that would achieve the same actual reduction in batch processing computational complexity. According to an embodiment, the present system and method tests the accuracy of the already trained CNN with all d different shifts, while all remaining layers remain the same. Then, the shift that achieves the best accuracy is chosen.
For example, let T(cyclic_idx) be the masking matrix T when all rows are cyclically shifted by cyclic_idx. According to an embodiment, the present system and method may select the best masking matrix at each layer in an iterative layer-by-layer manner as follows, where d_layer_idx is the value of d selected at the layer indexed by layer_idx.
At the first iteration, the best cyclic shift of the first layer may not take into account any masking performed at the consecutive layers. However, in the subsequent iterations, the best cyclic shift of the first layers may take into account all the masks that have been selected for the other consecutive layers. Iteration may be stopped if overall network accuracy is not improved.
There is generally a tradeoff in selecting the puncturing interval d. For example, a larger value of d implies less reduction in computational complexity but better accuracy due to less puncturing of the feature map. Different CNN layers may have different values of d, where d=1 implies no reduction in computational complexity. An iterative procedure, such as the one described above, may be followed to select an optimal d value, among an m number of different values of d, at each layer. That is, at each layer, m different masking matrixes may be tested assuming their first cyclic shift. Then, once the d values (d_layer_idx) are selected for each layer, the best cyclic shift (e.g., minimizes an error cost function) may be chosen for each layer by the above procedure.
Even after selection of the best cyclic shift at each layer, there may be loss in accuracy compared to the original network with no punctured feature maps. According to an embodiment, the present system and method redefines the network layers by adding a masking layer after each masked feature map to recover an original accuracy. The masking layer is similar to a pooling layer in that it may be used to reduce computational complexity. However, in contrast to a pooling layer, the masking layer does not reduce the feature map size; rather it performs a dot product of the feature map and/or kernel with the mask (e.g., mask T as previously discussed) and keeps the size of feature map and/or kernel the same, while overwriting some values of the feature map and/or kernel elements with zeros. The network may then be fine-tuned by backpropagation, in which gradients are calculated and used to modify the kernel values or weights towards reducing the error cost function or equivalently increase the objective accuracy.
Depending on the shift s, kernel size R×S, extra conditions on stride p may be imposed to ensure that all kernel weights are activated. For example, if both the feature map and the kernel have opposite puncturing patterns to one another, d=2, shift=1 and the other shift=2, and the convolutional stride p=2, then the output from the convolution will always be zero. Thus, according to an embodiment, the present system and method may set different values for the puncturing interval d of the feature map and the kernel to ensure that the output from the convolution is not zero.
According to an embodiment of the present system and method, batch convolutional processor for the oth output feature map can be modified to skip all MAC operations, where the kernel operand is zero as follows:
According to an embodiment, the regular puncturing pattern (also referred to as weight pruning pattern) may be set to be cyclic in the 3rd dimension by the condition ‘if mod(x+y+c,d)−s=0’, and a batch processor of the CNN system may take cyclic shifts in the 3rd dimension into account. According to another embodiment, all kernels in the 3rd dimension may use the same puncturing pattern as above with ‘if mod(x+y,d)−s≠0’.
According to an embodiment, to select the best mask for the kernels in each layer, the present system and method may select a mask that results in a largest root mean square value of the kernel after puncturing. The present system and method may effectively select the puncturing positions of the kernel with smallest weight values while constraining the puncturing pattern to be regular in all kernel dimensions.
After puncturing the kernels, to recover the original accuracy, the network may be fine-tuned by training examples to minimize the error cost function while imposing the already chosen regular puncturing patterns for each kernel. Hence, fine tuning may change only the weight values in the kernel that have not been masked by the regular pattern and leave the punctured positions to be zero.
According to an embodiment, the present system and method regularly punctures both feature maps and kernels. Hence, during batch convolutional processing, the batch operation may only be performed if both feature maps and the kernels are non-zero, i.e., if [mod(x+y, d_kernel)−2≠0] AND [T[w+x][h+y]≠0], which may result in a compound reduction in the computational complexity. Retraining the network with fine-tuning of both kernel puncturing and feature map puncturing improves an accuracy of the punctured network, according to an embodiment.
According to an embodiment, the present system and method reduces computational complexity of a CNN, including performing one or more of puncturing (setting to zero) a feature map and puncturing a kernel, where puncturing the feature map and puncturing the kernel includes setting one or more elements of a row to zero according to a predetermined pattern and cycling shifting the row at a predetermined interval. The present system and method further implements a masking layer in the CNN to allow fine-tuning with punctured feature maps and recover the accuracy.
According to an embodiment, a convolutional neural network (CNN) system may be implemented by one or more computers for generating a classification for an input image received by the CNN system. The CNN system may include a sequence of neural network layers, such as those shown in
The sequence of neural network layers and/or other elements of the CNN system may be configured to perform the operations outlined in the flow chart of
The process of 702 may be explained in the context of the punctured feature map 301 of
According to an embodiment, to puncture the at least one selection among the feature map and the kernel includes setting the value of an element at index (w, h) of the at least one selection to zero when:
((w+h)modulo d)−s=0,
where s is an integer, shift value; d is an integer, puncture interval value >1; and w and h are index values of the element.
According to an embodiment, to puncture the at least one selection among the feature map and the kernel includes performing dot product of the at least one selection with a masking matrix, where the value of an element at index (w, h) of the masking matrix is zero when:
((w+h)modulo d)−s=0, and
where the value of the element at index (w, h) of the masking matrix is one when:
((w+h)modulo d)−s≠0,
where s is an integer, shift value; d is an integer, puncture interval value >1; and w and h are index values of the element.
According to an embodiment, the CNN system may be further configured to fine-tune the sequence of neural network layers by: convolving the feature map, unpunctured, with the kernel, unpunctured, to generate a second convolved output, and evaluating the accuracy of the first convolved output according to an error cost function that compares the first convolved output and the second convolved output.
According to an embodiment, the CNN system may be further configured to fine-tune the sequence of neural network layers by puncturing the at least one selection according to variations of the shift value s to determine an optimal shift value that minimizes the error cost function.
According to an embodiment, the CNN system may be further configured to fine-tune the sequence of neural network layers by puncturing the at least one selection according to variations of the puncture interval value d to determine an optimal puncture value that minimizes the error cost function.
According to an embodiment, the CNN system may be further configured to fine-tune the sequence of neural network layers by performing backpropagation to adjust the values of the elements of the kernel by calculating gradients of the error cost function with respect to the elements of the kernel and minimizing the error cost function. If the at least one selection includes the kernel, performing backpropagation may adjust only the values of the elements of the kernel that are not set to zero by the puncturing.
According to an embodiment, if the at least one selection includes both the feature map and the kernel (i.e., both the feature map and kernel are punctured), the sequence of neural network layers may be configured to puncture the feature map and the kernel, respectively, using different puncturing interval values.
According to an embodiment, the sequence of neural network layers may be further configured to: calculate a maximum value and a maximum position corresponding to the maximum value using the at least one selection, and calculate a value at the maximum position using an unpunctured form of the at least one selection.
According to an embodiment, the cyclic shifted pattern may shift in a third dimension such that to puncture the at least one selection among the feature map and a kernel includes setting the value of an element at index (w, h, c) of the at least one selection to zero when:
((w+h+c)modulo d)−s=0,
where s is an integer, shift value; d is an integer, puncture interval value >1; and w, h and c are index values of the element.
Various embodiments of the present system and method may be implemented using hardware elements, software elements, or a combination of both. Examples of hardware elements may include processors, microprocessors, circuits, circuit elements (e.g., transistors, resistors, capacitors, inductors, and so forth), integrated circuits, application specific integrated circuits (ASIC), programmable logic devices (PLD), digital signal processors (DSP), field programmable gate array (FPGA), logic gates, registers, semiconductor device, chips, microchips, chip sets, and so forth. Examples of software may include software components, programs, applications, computer programs, application programs, system programs, machine programs, operating system software, middleware, firmware, software modules, routines, subroutines, functions, methods, procedures, software interfaces, application program interfaces (API), instruction sets, computing code, computer code, code segments, computer code segments, words, values, symbols, or any combination thereof. Determining whether an embodiment is implemented using hardware elements and/or software elements may vary in accordance with any number of factors, such as desired computational rate, power levels, heat tolerances, processing cycle budget, input data rates, output data rates, memory resources, data bus speeds and other design or performance constraints.
One or more aspects of at least an embodiment may be implemented by representative instructions stored on a machine-readable medium which represents various logic within the processor, which when read by a machine causes the machine to fabricate logic to perform the techniques described herein. Such representations, known as “IP cores” may be stored on a tangible, machine readable medium and supplied to various customers or manufacturing facilities to load into the fabrication machines that actually make the logic or processor.
While certain features set forth herein have been described with reference to various implementations, this description is not intended to be construed in a limiting sense. Hence, various modifications of the implementations described herein, as well as other implementations, which are apparent to persons skilled in the art to which the present disclosure pertains are deemed to lie within the scope of the present disclosure.
The foregoing description, for purpose of explanation, has been described with reference to specific embodiments. However, the illustrative discussions above are not intended to be exhaustive or to limit the scope of the claims to the precise forms disclosed. Many modifications and variations are possible in view of the above teachings. The embodiments were chosen in order to best explain the principles underlying the claims and their practical applications, to thereby enable others skilled in the art to best use the embodiments with various modifications as are suited to the particular uses contemplated.
This application claims priority to and the benefit of U.S. patent application Ser. No. 15/634,537 filed on Jun. 27, 2017, which in turn claims the benefit of U.S. Provisional Patent Application No. 62/486,626 filed on Apr. 18, 2017, and incorporates by reference herein the entire contents of the above-mentioned applications.
Number | Name | Date | Kind |
---|---|---|---|
9274036 | Malik et al. | Mar 2016 | B2 |
9430697 | Iladis et al. | Aug 2016 | B1 |
9576224 | Iladis et al. | Feb 2017 | B2 |
9779786 | Wu et al. | Oct 2017 | B1 |
9916522 | Ros Sanchez et al. | Mar 2018 | B2 |
10402700 | van den Oord et al. | Sep 2019 | B2 |
10482155 | Werner et al. | Nov 2019 | B2 |
10534994 | Kaul et al. | Jan 2020 | B1 |
10546237 | Heifets et al. | Jan 2020 | B2 |
10565207 | Huang | Feb 2020 | B2 |
10691975 | Bagherinezhad et al. | Jun 2020 | B2 |
11164071 | El-Khamy | Nov 2021 | B2 |
20150117760 | Wang et al. | Apr 2015 | A1 |
20150189193 | Kappeler et al. | Jul 2015 | A1 |
20150339570 | Scheffler | Nov 2015 | A1 |
20160180196 | Taylor | Jun 2016 | A1 |
20160321784 | Annapureddy | Nov 2016 | A1 |
20160358068 | Brothers et al. | Dec 2016 | A1 |
20170011288 | Brothers et al. | Jan 2017 | A1 |
20170024642 | Xiong et al. | Jan 2017 | A1 |
20170039456 | Saberian et al. | Feb 2017 | A1 |
20170103308 | Chang et al. | Apr 2017 | A1 |
20170124711 | Chandraker et al. | May 2017 | A1 |
20170020007 | Bichler | Jul 2017 | A1 |
20170200094 | Bruestle et al. | Jul 2017 | A1 |
20170293659 | Huang | Oct 2017 | A1 |
20170344882 | Ambrose et al. | Nov 2017 | A1 |
20170345130 | Wang et al. | Nov 2017 | A1 |
20180027224 | Javidnia | Jan 2018 | A1 |
20180032857 | Lele et al. | Feb 2018 | A1 |
20180046894 | Yao | Feb 2018 | A1 |
20180046898 | Lo | Feb 2018 | A1 |
20180046906 | Dally et al. | Feb 2018 | A1 |
20180060649 | Kastaniotis et al. | Mar 2018 | A1 |
20180060701 | Krishnamurthy et al. | Mar 2018 | A1 |
20180096226 | Aliabadi et al. | Apr 2018 | A1 |
20180129935 | Kim et al. | May 2018 | A1 |
20180137406 | Howard et al. | May 2018 | A1 |
20180137417 | Theodorakopoulos et al. | May 2018 | A1 |
20180139458 | Wang et al. | May 2018 | A1 |
20180164866 | Turakhia et al. | Jun 2018 | A1 |
20180174031 | Yang et al. | Jun 2018 | A1 |
20180181857 | Mathew et al. | Jun 2018 | A1 |
20180181864 | Mathew et al. | Jun 2018 | A1 |
20180189215 | Boesch et al. | Jul 2018 | A1 |
20180189595 | Yang et al. | Jul 2018 | A1 |
20180189651 | Henry et al. | Jul 2018 | A1 |
20180204110 | Kim et al. | Jul 2018 | A1 |
20180218275 | Arrigoni et al. | Aug 2018 | A1 |
20180225550 | Jacobsen et al. | Aug 2018 | A1 |
20180285254 | Baum et al. | Oct 2018 | A1 |
20180285734 | Chen et al. | Oct 2018 | A1 |
20180293691 | Nurvitadhi et al. | Oct 2018 | A1 |
20180300605 | Ambardekar et al. | Oct 2018 | A1 |
20180341872 | Wang et al. | Nov 2018 | A1 |
20190156201 | Bichler et al. | May 2019 | A1 |
20190286953 | Farhadi et al. | Sep 2019 | A1 |
20190325203 | Yao et al. | Oct 2019 | A1 |
20190340497 | Baraniuk et al. | Nov 2019 | A1 |
20190392297 | Lau et al. | Dec 2019 | A1 |
Number | Date | Country |
---|---|---|
2016090520 | Jun 2016 | WO |
2017129325 | Aug 2017 | WO |
2018073975 | Apr 2018 | WO |
Entry |
---|
Garland et al., “Low Complexity Multiply Accumulate Unit for Weight-Sharing Convolutional Neural Networks” Jan. 19, 2017, arXiv: 1609.05132v4, pp. 1-4. (Year: 2017). |
Han et al., “EIE: Efficient Inference Engine on Compressed Deep Neural Network” May 3, 2016, arXiv: 1602.01528v2, pp. 1-12. (Year: 2016). |
Venkatesh et al., “Accelerating Deep Convolutional Networks using Low-Precision and Sparsity” Oct. 2, 2016, arXiv: 1610.00324v1, pp. 1-5. (Year: 2016). |
Chen et al., “Semantic Image Segmentation with Deep Convolutional Nets and Fully Connected CRFs” Jun. 7, 2016, arXiv: 1412.7062v4, pp. 1-14. (Year: 2016). |
Chen et al., “DeepLab: Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolution, and Fully Connected CRFs”, Jun. 2, 2016, pp. 1-14. (Year: 2016). |
Figurnov et al., “PerforatedCNNs: Acceleration through Elimination of Redundant Convolutions”, Oct. 16, 2016, NIPS 2016, pp. 1-12. (Year: 2016). |
Giusti et al., “Fast Image Scanning with Deep Max-Pooling Convolutional Neural Networks”, 2013, IEEE, pp. 4034-4038. (Year: 2013). |
Lebedev et al., “Fast ConvNets Using Group-wise Brain Damage”, Dec. 7, 2015. (Year: 2015). |
Anwar et al., “Structured Pruning of Deep Convolutional Neural Networks”, Dec. 29, 2015. (Year: 2015). |
Changpinyo et al., “The Power of Sparsity in Convolutional Neural Networks”, Feb. 21, 2017, pp. 1-13. (Year: 2017). |
Chen, Mia Xu “Learning with Sparsity and Scattering in Networks”, Nov. 2015, Doctoral Dissertation, Princeton University, pp. i-113. (Year: 2015). |
Dong et al., “More is Less: A More Complicated Network with Less Inference Complexity”, Mar. 25, 2017, pp. 1-9. (Year: 2017). |
Dumoulin et al., “A guide to convolution arithmetic for deep learning”, Mar. 24, 2016, pp. 1-28. (Year: 2016). |
Galoogahi et al., “Correlation Filters with Limited Boundaries”, 2015, CVF, pp. 4630-4638. (Year: 2015). |
Li et al., “Pruning Filters for Efficient ConvNets”, Mar. 10, 2017, pp. 1-13. (Year: 2017). |
Molchanov et al., “Pruning Convolutional Neural Networks for Resource Efficient Transfer Learning”, Nov. 19, 2016, pp. 1-5. (Year: 2016). |
Wen et al., “Learning Structured Sparsity in Deep Neural Networks”, Oct. 18, 2016, pp. 1-10. (Year: 2016). |
Yang et al., “Designing Energy-Efficient Convolutional Neural Networks using Energy-Aware Pruning”, Apr. 12, 2017. (Year: 20 17). |
Zhang et al., “Split-Brain Autoencoders: Unsupervised Learning by Cross-Channel Prediction”, Apr. 4, 2017, pp. 1-11. (Year: 20 17). |
Long et al., “Fully Convolutional Networks for Semantic Segmentation”, 2015, pp. 3431-3440. (Year: 20 15). |
Bagherinezhad et al., “LCN N: Lookup-based Convolutional Neural Network”, Nov. 20, 2016. (Year: 2016). |
Dai et al., “Deformable Convolutional Networks”, Mar. 22, 2017. (Year: 2017). |
Iliadis et al., “DeepBinaryMask: Learning a Binary Mask for Video Compressive Sensing”, Jul. 18, 2016, pp. 1-13. (Year: 2016). |
Xu et al., “Local Binary Convolutional Neural Networks”, Aug. 22, 2016, pp. 1-10. (Year: 2016). |
Papyan et al., “Working Locally Thinking Globally- Part II: Stability and Algorithms for Convolutional Sparse Coding”, Feb. 22, 2017, pp. 1-13. (Year: 2017). |
Shi et al., “Real-Time Single Image and Video Super-Resolution Using an Efficient Sub-Pixel Convolutional Neural Network”, 2016, pp. 1874-1883. (Year: 2016). |
Anwar et al., “Structured Pruning of Deep Convolutional Neural Networks”, Dec. 29, 2015, pp. 1-11. (Year: 2015). |
LeCun et al., “Backpropagation Applied to Handwritten Zip Code Recognition”, 1989, pp. 541-551. (Year: 1989). |
Li et al., “Enabling Sparse Winograd Convolution by Native Pruning”, Feb. 28, 2017, pp. 1-10. (Year: 2017). |
Liu et Turakhia, “Pruning of Winograd and FFT Based Convolution Algorithm”, 2016, pp. 1-7. (Year: 2016). |
Liu et al., “Efficient Sparse-Winograd Convolutional Neural Networks”, Feb. 19, 2017, pp. 1-4. (Year: 2017). |
Theodorakopoulos et al., “Parsimonious Inference on Convolutional Neural Networks: Learning and applying on-line kernel activation rules” Jan. 2017. (Year: 2017). |
Osendorfer et al., “Image Super-Resolution with Fast Approximate Convolutional Sparse Coding”, 2014, pp. 250-257. (Year: 2014). |
Motamedi et al., “Design Space Exploration of FPGA-Based Deep Convolutional Neural Networks”, Jan. 2016, pp. 575-580. (Year: 2016). |
Zennaro et Chen, “On Covariate Shift Adaptation via Sparse Filtering”, Jul. 22, 2016, pp. 1-19. (Year: 2016). |
Tan et al., “Elastic Flow: A Complexity-Effective Approach for Pipelining Irregular Loop Nests”, Nov. 2015. (Year: 2015). |
Chen et al., “Eyeriss: An Energy-Efficient Reconfigurable Accelerator for Deep Convolutional Neural Networks”, Jan. 2017, pp. 127-138. (Year: 2017). |
Zhong et al., “Design Space Exploration of FPGA-based Accelerators with Multi-level Parallelism”, Mar. 2017, pp. 1141-1146. (Year: 2017). |
Canis et al., “Modulo SOC Scheduling with Recurrence Minimization in High-Level Synthesis”, Sep. 2014. (Year: 2014). |
Zhang et Liu, “SOC-Based Modulo Scheduling for Pipeline Scheduling”, 2013, pp. 211-218. (Year: 2013). |
Number | Date | Country | |
---|---|---|---|
20210406647 A1 | Dec 2021 | US |
Number | Date | Country | |
---|---|---|---|
62486626 | Apr 2017 | US |
Number | Date | Country | |
---|---|---|---|
Parent | 15634537 | Jun 2017 | US |
Child | 17473813 | US |