The 1-bit convolutional neural network (1-bit CNN, also known as binary neural network), wherein both weights and activations are binary, is one of the most promising neural network compression methods for deploying models onto resource-limited devices. It enjoys 32× memory compression ratio, and up to 58× practical computational reduction. Moreover, with its pure logical computation (i.e., XNOR operations between binary weights and binary activations), the 1-bit CNN is both highly energy-efficient for embedded devices and possesses the potential of being directly deployed on next generation memristor-based hardware.
Despite these attractive characteristics of 1-bit CNN, severe accuracy degradation prevents it from being broadly deployed. For example, a representative binary network, XNOR-Net only achieves 51.2% accuracy on the ImageNet classification dataset, leaving an 18% accuracy gap from the real-valued ResNet-18. Some preeminent binary networks show good performance on small datasets such as CIFAR10 and MNIST, but still encounter severe accuracy drop when applied to a large dataset such as ImageNet. Therefore, it is desirable to provide a design for a 1-bit CNN that improves its accuracy, while preserving the characteristics that make it attractive for deployment on resource-limited devices.
Disclosed herein is a design for a 1-bit CNN that closes the performance gap between binary neural networks and real-valued networks on challenging large-scale datasets. The invention starts with a design for a high-performance baseline network. In one embodiment, MobileNetV1 is chosen as the binarization backbone, although in other embodiments, any binary backbone may be used. Next, the invention adopts blocks with identity shortcuts which bypass 1-bit generic convolutions to replace the convolutions in MobileNetV1. Moreover, the invention uses a concatenation of two of such blocks to handle the channel number mismatch in the downsampling layers, as shown in
To further enhance the accuracy, the invention introduces activation distribution reshaping and shifting via non-linearity function design. The overall activation value distribution affects the feature representation, and this effect will be exaggerated by the activation binarization. A small distribution value shift near zero will cause the binarized feature map to have a disparate appearance which influences the final accuracy. This achieved by a new generalization of the Sign and PReLU functions which explicitly shift and reshape the activation distribution, referred to herein as ReAct-Sign (RSign) and ReAct-PReLU (RPReLU) respectively. These novel activation functions adaptively learn the parameters for distributional reshaping, which enhance the accuracy of the baseline network with negligible extra computational cost.
Furthermore, the invention introduces a distributional loss to enforce the output distribution similarity between the binary and real-valued networks, which further boosts the accuracy.
The novel aspects of the invention can be summarized as follows: (1) a baseline binary network is provided as a modification of MobileNetV1; (2) a channel-wise reshaping and shifting operation on the activation distribution, which helps binary convolutions spare the computational power in adjusting the distribution to learn more representative features; and (3) a distributional loss between binary and real-valued network outputs, replacing the original loss, which allows the binary network to mimic the distribution of a real-valued network.
By way of example, a specific exemplary embodiment of the disclosed system and method will now be described, with reference to the accompanying drawings, in which:
In a 1-bit convolutional layer, both weights and activations are binarized to −1 and +1, such that the computationally heavy operations of floating-point matrix multiplication can be replaced by light-weighted bitwise XNOR operations and popcount operations, as:
χb*b=popcount(XNOR(χb,b)) (1)
where:
Specifically, weights and activations are binarized through a Sign function:
where:
Note that with the introduction of the novel ReAct operations, discussed below, this scaling factor for activations becomes unnecessary and can be eliminated.
Baseline Network—In a primary embodiment, the MobileNetV1 structure is chosen for constructing the baseline binary network. A shortcut is added to bypass every 1-bit convolutional layer that has the same number of input and output channels. The 3×3 depth-wise and the 1×1 point-wise convolutional blocks in MobileNetV1 are replaced by the 3×3 and 1×1 generic convolutions in parallel with shortcuts, respectively, as shown in
Additionally, a new structure design to handle the downsampling layers is provided. For the downsampling layers whose input and output feature map sizes differ, prior art works adopt real-valued convolutional layers to match their dimension and to make sure the real-valued feature map propagating along the shortcut will not be “cut off” by the activation binarization. However, this strategy increases the computational cost. Instead, the present invention ensures that all convolutional layers have the same input and output dimensions such that they can be safely binarized and uses a simple identity shortcut for activation propagation without additional real-valued matrix multiplications.
As shown in
As shown in
ReActNet—The intrinsic property of an image classification neural network is to learn a mapping from input images to the output logits. A logical deduction is that a good performing binary neural network should learn similar logits distribution as a real-valued network. However, the discrete values of variables limit binary neural networks from learning as rich distributional representations as real-valued ones. To address it, XNOR-Net calculates analytical real-valued scaling factors and multiplies them with the activations. These factors may be learned through back-propagation.
In contrast to these previous works, the present invention focuses on a different aspect: the activation distribution. Small variations to activation distributions can greatly affect the semantic feature representations in 1-bit CNNs, which, in turn, will influence the final performance. However, 1-bit CNNs have limited capacity to learn appropriate activation distributions. To address this dilemma, generalized activation functions are introduced with learnable coefficients to increase the flexibility of 1-bit CNNs for learning semantically-optimized distributions.
For 1-bit CNNs, learning distribution is both crucial and difficult. Because the activations in a binary convolution can only choose values from {−1, +1}, making a small distributional shift in the input real-valued feature map before the sign function can result in a completely different output binary activations, which will directly affect the informativeness in the feature and significantly impact the final accuracy. For illustration, the output binary feature maps of real-valued inputs are plotted, with the original (shown in
Based on the aforementioned observation, disclosed herein is an effective operation to explicitly reshape and shift the activation distributions, referred to herein as “ReAct”, which generalizes the traditional Sign and PReL U activation functions to ReAct-Sign (“Rsign”) and ReAct-PReLU (“RPReLU”) respectively.
Essentially, RSign is defined as a Sign function with channel-wise learnable thresholds:
where:
Similarly, RPReLU is defined as:
where:
All of the coefficients can be different across channels.
Intrinsically, RSign is learning the best channel-wise threshold (a) for binarizing the input feature map, or equivalently, shifting the input distribution to obtain the best distribution for taking a sign. From the latter angle, RPReLU can be easily interpreted as γ shifts the input distribution, finding a best point to use β to “fold” the distribution, then ζ shifts the output distribution. As illustrated in
A ReActNet block is shown in
The number of extra parameters introduced by RSign and RPReLU is only the number of channels in the network, which is negligible considering the large size of the weight matrices. The computational overhead approximates a typical non-linear layer, which is also trivial compared to the computational intensive convolutional operations.
Optimization—Parameters in RSign and RPReLU can be optimized end-to-end with other parameters in the network. The gradient of αi in RSign can be simply derived by the chain rule as:
where:
The summation is applied to all entries in the ith channel. The derivative
can be easily computed as:
Similarly, for each parameter in RPReLU, the gradients are computed with the following formulae:
Here, I denotes the indicator function I{⋅}=1 when the inequation inside { } holds, otherwise I{⋅}=0.
Distributional Loss—Because the binary neural networks can learn distributions similar to those of real-valued networks, the performance can be enhanced. A distributional loss to enforce this similarity, is formulated as:
where:
Herein were disclosed several novel ideas to optimize a 1-bit CNN for higher accuracy. First, parameter-free shortcuts were designed based on MobileNetV1 to propagate real-valued feature maps in both normal convolutional layers as well as the downsampling layers. Then, based on the observation that 1-bit CNNs performance is highly sensitive to distributional variations, ReAct-Sign and ReAct-PReLU were introduced to enable shift and re-shape the distributions in a learnable fashion and to demonstrate the dramatical enhancements on the top-1 accuracy. Additionally, the invention also incorporates a distributional loss, which is defined between the outputs of the binary network and the real-valued reference network, to replace the original cross-entropy loss for training.
As would be realized by one of skill in the art, the methods described herein can be implemented by a system comprising a processor and memory, storing software that, when executed by the processor, performs the functions comprising the method.
This application is a national phase filing under 35 U.S.C. § 371 claiming the benefit of and priority to International Patent Application No. PCT/US22/13680, filed Jan. 25, 2022, entitled “Binary Neural Networks with Generalized Activation Functions,” which claims the benefit of U.S. Provisional Patent Application No. 63/146,758, filed Feb. 8, 2021, the contents of which are incorporated herein their entireties.
Filing Document | Filing Date | Country | Kind |
---|---|---|---|
PCT/US22/13680 | 1/25/2022 | WO |
Number | Date | Country | |
---|---|---|---|
63146758 | Feb 2021 | US |