Neural Architecture Search (NAS) has emerged as a state-of-the-art method that exploits Artificial Intelligence (AI) to automatically design deep neural networks. The method involves searching between a large number of architectures to find an architecture that provides the desired combination of accuracy and efficiency. Early search methods required large amounts of computation. However, approaches such weight-sharing NAS and Differentiable NAS can greatly reduce the computation time.
Weight-sharing NAS first designs a large network called a super-net that contains many possible sub-networks. The problem is to find an appropriate sub-network that provides high accuracy for a chosen task. In weight-sharing NAS, different sub-networks share the same weights, and all sub-networks are trained jointly. This removes the main limitation of earlier NAS methods, which typically sampled individual sub-networks and trained them independently in parallel over many processors.
Differentiable NAS (DNAS) is a different class of NAS techniques. In DNAS, a super-net containing all possible sub-networks is trained jointly with architecture parameters (α-parameters). A super-net assembles all candidate architectures into a weight sharing network, with each architecture option corresponding to one sub-network. By training the sub-networks simultaneously with the super-net, different architectures can directly inherit the weights from the super-net for evaluation and deployment. This approach eliminates the extremely large cost of training or fine-tuning each architecture option individually.
The architecture parameters (α-parameters) in DNAS represent the importance, or probability, of different decision choices at various locations inside a super-net. Specifically, training a regular deep network involves updating weight parameters using an optimization algorithm, such as stochastic gradient descent (SGD). However, DNAS not only updates the actual weights (of operations like two-dimensional convolutions, etc.), but also the architecture parameters. Hence, weights and architecture parameters are trained jointly. At the end of training, operations corresponding to maximum architectural parameter values are chosen. The process involves a final training stage for the architecture determined by maximum architectural parameters.
A disadvantage of weight-sharing NAS, whether using sub-network sampling or stochastic gradient descent, is that the performance of a sub-network trained as part of the super-net may be very different to the performance of the sub-network when individually trained. In addition, in DNAS, the learning process may become unstable.
The accompanying drawings provide visual representations which will be used to describe various representative embodiments more fully and can be used by those skilled in the art to better understand the representative embodiments disclosed and their inherent advantages. In these drawings, like reference numerals identify corresponding or analogous elements.
The various apparatus and devices described herein provide mechanisms for automated design of neural network architecture or topology. In particular, mechanisms are disclosed for automated selection of a sub-network from a super-net containing multiple candidate sub-networks.
While this present disclosure is susceptible of embodiment in many different forms, there is shown in the drawings and will herein be described in detail specific embodiments, with the understanding that the embodiments shown and described herein should be considered as providing examples of the principles of the present disclosure and are not intended to limit the present disclosure to the specific embodiments shown and described. In the description below, like reference numerals are used to describe the same, similar or corresponding parts in the several views of the drawings. For simplicity and clarity of illustration, reference numerals may be repeated among the figures to indicate corresponding or analogous elements.
The present disclosure relates to a mechanism to control the stability and performance of weight-sharing methods for designing neural networks. By way of example, a weight-sharing differentiable neural architecture search (DNAS) design method is described where weights are shared across multiple sub-networks during the training of a super-net. The approach recognizes that sharpness/smoothness of network layers is a fundamental property that can determine the stability/convergence of weight-sharing NAS. Previously, smoothness measures have only been defined for differences in the functions represented by a given network with respect to training and test datasets (due to differences in input data distributions between training and test sets). In contrast, the disclosed approach uses a measure of smoothness, or lack thereof, defined for the loss landscapes of various sub-networks in the architecture space of a super-net. The approach recognizes that different sub-networks have different distributions of architecture-parameters (α-parameters) and weight-parameters (W-parameters).
As discussed in greater detail below, a conventional weight-sharing NAS problem is augmented with an additional loss function that specifically pushes the individual sub-networks to become smoother. This new loss function approximates the Lipschitz constant of the neural network and minimizes this loss jointly for all sub-networks sampled for a batch of data.
Experimental results are presented that demonstrate the advantages of the proposed approach and clearly show that the sharpness metric (approximate Lipschitz) is noticeable reduced while the accuracy improves. Results also show that the disclosed approach results in networks with higher accuracy at a similar hardware cost.
Weight-sharing NAS has become very important for hardware-aware NAS. Hardware-aware NAS is an automated technique to build neural networks and produces very efficient deep networks for a given hardware. However, instability in weight-sharing NAS is a critical problem. In addition to NAS, weight-sharing also presents significant challenges to other well-known and challenging problems like multi-task learning, multi-modal learning, etc.
Data processor 100 may be implemented, for example, on a general-purpose processor, graphics processing unit, vector processor or array processor. The super-net may be implemented using custom hardware or a combination of custom and general-purpose hardware.
The present disclosure relates to improved mechanisms for automated selection of a sub-network, from super-net 102, for a chosen task or application. Training data 106 is provided for the chosen task. The training data includes a set of training inputs and corresponding training outputs. During training, a data loader 108 is configured to supply training inputs 110 to super-net 102 to produce outputs 112. For example, output 112 may be a label classifying the training input 110.
Outputs 112 are passed to supervised learning controller 114. Output 112 is compared to corresponding desired training output 116 in supervised learning controller 114. Network weights, W, of network 102 are adjusted by an amount δW (118), to reduce a cost function computed from a difference between training output 116 and network output 112.
One approach for Neural Architecture Search (NAS) is weight-sharing NAS, discussed above. In this approach, the performances of different combinations of sub-networks are compared to select a final sub-network.
Another approach is differentiable NAS (DNAS). DNAS also uses a super-net containing all possible sub-networks. However, the search space is relaxed to be continuous. This enables the architecture to be optimized by gradient descent. A super-net containing all possible sub-networks is trained jointly with architecture parameters (α-parameters). The super-net includes paths with selectable operations such as convolution, max-pooling, average pooling, etc. The architecture parameters represent the importance, or probability, of different architecture choices at various locations inside a super-net. Training a regular deep network involves updating weight parameters using an optimization algorithm, such as stochastic gradient descent (SGD). DNAS not only updates the actual weights of operations, but also the architecture parameters. In
Both gradient-based training, such as DNAS, and sample-based NAS, make use of a measure of performance referred to as a “loss function,” or simply a “loss.” The present disclosure provides a loss function that enables selection of a final network architecture having improved performance.
One embodiment of the disclosure is a data processor that includes a super-net including a plurality of selectable sub-networks, the super-net including network weights and architecture parameters, a data loader configured to access one or more batches of training data for a designated task, and a supervised learning controller configured to train network weights and architectural parameters of the super-net. The training is accomplished by providing training inputs of the training data to sample sub-networks of the super-net to generate sub-network outputs, accumulating a loss over the sample sub-networks, the accumulated loss based, at least in part, on a sum, over layers of a sub-network, of measures of smoothness based on network weights in the layers, and adjusting network weights and architectural parameters of the super-net to reduce the accumulated loss. The supervised learning controller is also configured to select a sub-network of the plurality of sub-networks dependent upon the adjusted architectural parameters and output a description of the selected sub-network. As described below, the accumulated loss may combine a first loss and a second loss, where the first loss is based on a difference between an output of a sample sub-network, generated from a training input, and a corresponding training output, and the second loss is a product of a smoothness scale factor and the sum, over layers of the sample sub-network, of measures of smoothness based on network weights in the layers.
Embodiments of the present disclosure may be implemented in a system comprising one or more computers and one or more storage devices storing instructions that, when executed by the one or more computers, cause the one or more computers to select a neural network architecture, as described below.
CNN 406 and smaller CNNs 408, 410 and 412 are depicted as being combined with weighting value fi ({right arrow over (α)}), where {right arrow over (α)}is a vector of architecture parameters. These parameters may represent relaxed selections or probabilities, for example. In one embodiment, the weighting value is a normalized non-negative linear weight, fi ({right arrow over (α)})=αi/Σnαn. In a further embodiment, the weighting value is a soft-max value fi ({right arrow over (α)})=exp(αi)/Σnexp(αn). Other weighting values may be used without departing from the present disclosure. The final architecture is obtained by setting one a to unity and the other to zero.
A block 400 may be configured to select a kernel size and number of channels for a layer. This may be achieved using a super-kernel, which results in a very efficient NAS method. The concept of super-kernel is depicted in
Various NAS techniques, including sampling-based searches and gradient-based searches, use weight-sharing. However, while weight-sharing produces highly efficient NAS solutions, there is a major problem associated with this technique: namely, since same set of weights are parts of different sub-networks, weight-sharing can lead to a severe instability in gradients when training the super-nets. For instance, say for a given layer, 128 channels are sampled for a first training batch and 256 channels at the same layer are sampled for the next training batch. In this case, since the first 128 channels are common for both sub-network samples, they are being trained as part of different functions for two different training batches of data. The present disclosure recognizes that training the same set of weights with different functions can result in conflicting gradient updates. For example, the gradients for two sub-networks could be orthogonal, or in opposite directions. As a result, weight-sharing NAS can result in significant instability in the training process due to uncorrelated gradient updates. This, in turn, can result in poor convergence during super-net training and even convergence to suboptimal solutions.
In accordance with the present disclosure, a mechanism is provided to improve the stability of weight-sharing techniques. A smoothness-based loss term is added for evaluating performance while training a super-net. The loss term penalizes sub-networks with lower smoothness. As a result, the optimization landscape of various sub-networks in the given super-net becomes smoother (or flatter), resulting in lower discrepancy between the loss functions of different sub-networks. The gradients of various sub-networks automatically become more stable, leading to significantly better training convergence and accuracy in the search. In one embodiment, the smoothness-based loss term is related to Lipschitz continuity, which is a measure of a function's flatness (a lower Lipschitz constant indicates higher smoothness or flatness).
The provision of a smoothness-based loss term addresses the problem of instability in weight-sharing DNAS by making the optimization landscape of sub-networks smoother. Consequently, there is lower discrepancy among the gradients of different sub-networks. This, in turn, results in stable gradients, better convergence, and higher accuracy.
Training a conventional deep neural network, there is a distribution shift since training and testing datasets belong to slightly different distributions. The present disclosure recognizes that, in a weight-sharing NAS scenario, the super-net and all sub-networks represent a distribution shift in the neural network architecture space. That is, the architectures and their weights belong to slightly different distributions in the space of all possible sub-networks. A new notion of sharpness is defined in the space of neural network architectures. This provides a metric for the stability of weight-sharing NAS.
The present disclosure introduces an additional loss function by which the search process is evaluated. The loss function penalizes networks that have steeper loss landscapes in favor of smoother landscapes. In one embodiment, the measure of smoothness is based on the Lipschitz Constant. A function ƒ: Rn→Rm the maps length n real vectors to length m real vectors is call Lipschitz continuous if there exists a constant C such that:
∀ x, y∈Rn, ∥ƒ(x)−ƒ(y)∥2≥C∥x−y∥2,
where x, y are inputs and ƒ(x), ƒ(y) are corresponding outputs. ∥ . . . ∥2 indicates a L2 vector norm. A lower value of the Lipschitz constant (C) indicates a smoother or flatter function f. The mechanism disclosed here biases the architecture search to minimize the Lipschitz constant of all sub-networks in a weight-sharing NAS.
Neural networks are generally non-convex functions, so computing their exact Lipschitz constant is not practical. Consequently, the approach uses approximations to the Lipschitz constant, for each layer in a given sub-network. Given a linear layer with weight matrix W, the L2-norm based Lipschitz constant is given by the maximum singular value σmax of the matrix. Calculation of a singular value decomposition of large matrix is computationally very expensive.
In one embodiment, the maximum singular value for a square matrix W is estimated as:
For a non-square matrix W, the maximum singular value is estimated as:
The maximum singular value σmax is a measure of the lack of smoothness, in that a weight matrix with a smaller maximum singular value is smoother than one with a larger value. The Lipschitz loss function for each sub-network is added to the conventional output-error loss, such a cross-entropy loss, LossCE. In this way, we minimize approximate Lipschitz constants of each sub-network during the search process of weight-sharing NAS. The final loss function is:
where the first loss, LossCE, is a cross-entropy loss for a given training input and sub-network and the second loss, λΣlayer iσmaxi, is a sum, over layers of the sub-network, of estimates of the maximum singular value of the weight matrix for the layer. λ is a smoothness scale factor that determines the relative importance of the first and second loss terms.
Thus, the final loss function is based, at least in part, on a measure of the smoothness of the sub-network.
In one embodiment the above loss function is minimized jointly for multiple sub-networks for the same batch of training data. This biases each sub-network to having greater smoothness. In turn, this helps to achieve better training convergence and higher accuracy during weight-sharing NAS. The inclusion of the second term recognizes that smoothness with respect to sub-network weights can contribute to the stability of neural architecture search.
In the example results described below, super-nets are constructed for an AttentiveNAS-like search spaces, and two examples are considered.
AttentiveNAS-like (Super-net A), in which the super-net includes 17 Inverted Bottle-Neck (IBN) blocks that were first introduced in the MobileNet-V2 model. For each of convolution layer, a search is performed over 2-4 options for the number of output channels and two options of the kernel size ({3×3}, {5×5}) for the depth-wise convolution layers.
AttentiveNAS-like (Super-net B) where the super-net includes up to 15 IBN blocks. For each of the convolution layer, a search is performed over 7 options for number of output channels, and the layer/block type is selected to be an IBN block, an average pooling layer or a fused convolution layer for the {1, 6, 11 }block of the super-net.
Search of the architecture space is performed using α-parameter, to learn the optimal architecture, and using the weight-sharing mechanism introduced by Single Path NAS.
The standard training datasets CIFAR-10/CIFAR-100 are used to adjust the weights and architecture parameters and to evaluate the architecture found by the search. Each dataset is divided into a number of data batches. In each training epoch, the super-net is trained using all of the data batches in the dataset.
A Stochastic Gradient Descent (SGD) optimizer, with momentum 0.9, is used to train the super-net for 60 epochs. In each training step, a batch of training data is accessed and used to adjust the weights and architecture parameters. Cosine Annealing learning scheduling is used to tune the learning rate.
For super-net A, four sub-networks are sampled on the same training batch with batch size 500 and the aggregated gradient is used to update the parameters. For super-net B, the batch size is 250 and a single sample sub-network is used for every training batch. The aggregated gradient for every four training batches is used to update the weights and architecture parameters.
Hardware-aware NAS: In a further example, networks were trained using the CIFAR-10 dataset. In addition, the number of parameters was introduced as another loss term for both SmoothNAS and baseline methods. The objective was to create a hardware-aware NAS with a reduced number of parameters while simultaneously training the super-net.
Table 1 shows result of hardware-aware NAS for super-net B on the CIFAR-10 dataset. The results indicate that training with the disclosed Lipschitz loss function provided better test performance (0.95% higher accuracy) and required less computation (1.9M fewer floating-point operations per second).
The results illustrate that SmoothNAS is also applicable and effective under a hardware-aware NAS setup. In particular, SmoothNAS results in a similar hardware cost but achieves higher test accuracy.
In this document, relational terms such as first and second, top and bottom, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. The terms “comprises,” “comprising,” “includes,” “including,” “has,” “having,” or any other variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. An element preceded by “comprises . . . a” does not, without more constraints, preclude the existence of additional identical elements in the process, method, article, or apparatus that comprises the element.
Reference throughout this document to “one embodiment,” “certain embodiments,” “an embodiment,” “implementation(s),” “aspect(s),” or similar terms means that a particular feature, structure, or characteristic described in connection with the embodiment is included in at least one embodiment of the present disclosure. Thus, the appearances of such phrases or in various places throughout this specification are not necessarily all referring to the same embodiment. Furthermore, the particular features, structures, or characteristics may be combined in any suitable manner in one or more embodiments without limitation.
The term “or,” as used herein, is to be interpreted as an inclusive or meaning any one or any combination. Therefore, “A, B or C” means “any of the following: A; B; C; A and B; A and C; B and C; A, B and C.” An exception to this definition will occur only when a combination of elements, functions, steps or acts are in some way inherently mutually exclusive.
As used herein, the term “configured to,” when applied to an element, means that the element may be designed or constructed to perform a designated function, or that is has the required structure to enable it to be reconfigured or adapted to perform that function.
Numerous details have been set forth to provide an understanding of the embodiments described herein. The embodiments may be practiced without these details. In other instances, well-known methods, procedures, and components have not been described in detail to avoid obscuring the embodiments described. The disclosure is not to be considered as limited to the scope of the embodiments described herein.
Those skilled in the art will recognize that the present disclosure has been described by means of examples. The present disclosure could be implemented using hardware component equivalents such as special purpose hardware and/or dedicated processors which are equivalents to the present disclosure as described and claimed. Similarly, dedicated processors and/or dedicated hard wired logic may be used to construct alternative equivalent embodiments of the present disclosure.
Various embodiments described herein are implemented using dedicated hardware, configurable hardware or programmed processors executing programming instructions that are broadly described in flow chart form that can be stored on any suitable electronic storage medium or transmitted over any suitable electronic communication medium. A combination of these elements may be used. Those skilled in the art will appreciate that the processes and mechanisms described above can be implemented in any number of variations without departing from the present disclosure. For example, the order of certain operations carried out can often be varied, additional operations can be added, or operations can be deleted without departing from the present disclosure. Such variations are contemplated and considered equivalent.
The various representative embodiments, which have been described in detail herein, have been presented by way of example and not by way of limitation. It will be understood by those skilled in the art that various changes may be made in the form and details of the described embodiments resulting in equivalent embodiments that remain within the scope of the appended claims.