Knowledge distillation is a general model compression/optimization method that transfers the knowledge of a large teacher network (or set thereof) to a smaller student network, or from a network whose architecture is suited to run on one type of hardware to a network whose architecture is suited to run on a different type of hardware. However, in traditional end-to-end distillation, candidate student networks (which are usually a variant of the teacher network with a smaller number of layers and/or parameters, or with an architecture that is suited to a different type of hardware) must be individually trained to mimic the output of the teacher network, and then compared to one another in order to choose which student network is best in terms of complexity and/or accuracy. Because some layers or groups of layers in a deep neural network will be harder to distill than others, finding the ideal architecture for the student network can require consideration of a large number of candidate student networks, and thus can be both computationally expensive and time-consuming.
The present technology relates to systems and methods for distilling deep neural networks. In that regard, the technology relates to distilling a teacher network into a student network by selecting blocks or “neighborhoods” of the teacher network, training individual student models to reproduce the output of each teacher neighborhood, selecting the best student model corresponding to each teacher neighborhood, and then assembling the student models to create a full student network.
In one aspect, the disclosure describes a method of using a first neural network to generate a second neural network, comprising: (i) dividing the first neural network into a plurality of neighborhoods; (ii) for each given neighborhood of the plurality of neighborhoods: generating, by one or more processors of a processing system, a plurality of candidate student models; receiving, by the one or more processors, a first output from the given neighborhood, the first output having been produced by the given neighborhood based on an input; receiving, by the one or more processors, a plurality of second outputs, each second output having been produced by a given candidate student model of the plurality of candidate student models based on the input; comparing, by the one or more processors, the first output to each second output of the plurality of second outputs to generate a first training gradient corresponding to each candidate student model of the plurality of candidate student models; modifying, by the one or more processors, one or more parameters of each given candidate student model of the plurality of candidate student models based at least in part on the first training gradient corresponding to the given candidate student model; and identifying, by the one or more processors, a selected model, the selected model being a copy of one of the plurality of candidate student models or a copy of the given neighborhood; and (iii) combining, by the one or more processors, the selected model corresponding to each given neighborhood of the plurality of neighborhoods to form the second neural network. In some aspects, identifying the selected model is based at least in part on a comparison of a size of each candidate student model of the plurality of candidate student models. In some aspects, identifying the selected model is based at least in part on a comparison of a number of layers of each candidate student model of the plurality of candidate student models. In some aspects, identifying the selected model is based at least in part on a comparison of a measurement of how closely each candidate student model of the plurality of candidate student models approximates the output of the given neighborhood. Further, in some aspects, the measurement of how closely each candidate student model of the plurality of candidate student models approximates the output of the given neighborhood is based at least in part on a mean square error between an output of the given neighborhood based on a given input and an output of each candidate student model of the plurality of candidate student models based on the given input. In some aspects, the input comprises an output received from a neighborhood preceding the given neighborhood.
In some aspects, the method set forth in the preceding paragraph further comprises, for each given neighborhood of the plurality of neighborhoods: providing, by the one or more processors, the first output to a head model, the head model comprising a copy of a portion of the first neural network which directly follows the given neighborhood; providing, by the one or more processors, each second output of the plurality of second outputs to the head model; receiving, by the one or more processors, a third output from a head model, the third output having been produced by the head model based on the first output; receiving, by the one or more processors, a plurality of fourth outputs, each fourth output having been produced by the head model based on a given second output of the plurality of second outputs; and comparing, by the one or more processors, the third output to each fourth output of the plurality of fourth outputs to generate a second training gradient corresponding to each candidate student model of the plurality of candidate student models; and modifying, by the one or more processors, the one or more parameters of each given candidate student model of the plurality of candidate student models based at least in part on the second training gradient corresponding to the given candidate student model. Further, in some aspects, identifying the selected model is based at least in part on a comparison of a measurement of how closely each candidate student model of the plurality of candidate student models approximates the output of the given neighborhood. Further still, in some aspects, the measurement of how closely each candidate student model of the plurality of candidate student models approximates the output of the given neighborhood is based at least in part on a mean square error between an output of the given neighborhood based on a given input and an output of each candidate student model of the plurality of candidate student models based on the given input. In some aspects, the input comprises an output received from a neighborhood preceding the given neighborhood.
In another aspect, the disclosure describes a processing system comprising: a memory; and one or more processors coupled to the memory and configured as follows. In that regard, for each given neighborhood of a plurality of neighborhoods, where each given neighborhood comprises a piece of a first neural network, the one or more processors are configured to: generate a plurality of candidate student models; receive a first output from the given neighborhood, the first output having been produced by the given neighborhood based on an input; receive a plurality of second outputs, each second output having been produced by a given candidate student model of the plurality of candidate student models based on the input; compare the first output to each second output of the plurality of second outputs to generate a first training gradient corresponding to each candidate student model of the plurality of candidate student models; modify one or more parameters of each given candidate student model of the plurality of candidate student models based at least in part on the first training gradient corresponding to the given candidate student model; and identify a selected model, the selected model being a copy of one of the plurality of candidate student models or a copy of the given neighborhood. In addition, the one or more processors are configured to combine the selected model corresponding to each given neighborhood of the plurality of neighborhoods to form a second neural network. In some aspects, the one or more processors are further configured to identify the selected model based at least in part on a comparison of a size of each candidate student model of the plurality of candidate student models. In some aspects, the one or more processors are further configured to identify the selected model based at least in part on a comparison of a number of layers of each candidate student model of the plurality of candidate student models. In some aspects, the one or more processors are further configured to identify the selected model based at least in part on a comparison of a measurement of how closely each candidate student model of the plurality of candidate student models approximates the output of the given neighborhood. Further, in some aspects, the measurement of how closely each candidate student model of the plurality of candidate student models approximates the output of the given neighborhood is based at least in part on a mean square error between an output of the given neighborhood based on a given input and an output of each candidate student model of the plurality of candidate student models based on the given input. In some aspects, the input comprises an output received from a neighborhood preceding the given neighborhood.
In some aspects, the one or more processors described in the preceding paragraph are further configured to, for each given neighborhood of the plurality of neighborhoods: provide the first output to a head model, the head model comprising a copy of a portion of the first neural network which directly follows the given neighborhood; provide each second output of the plurality of second outputs to the head model; receive a third output from the head model, the third output having been produced by the head model based on the first output; receive a plurality of fourth outputs, each fourth output having been produced by the head model based on a given second output from the plurality of second outputs; compare the third output to each fourth output of the plurality of fourth outputs to generate a second training gradient corresponding to each candidate student model of the plurality of candidate student models; and modify the one or more parameters of each given candidate student model of the plurality of candidate student models based at least in part on the second training gradient corresponding to the given candidate student model. Further, in some aspects, the one or more processors are further configured to identify the selected model based at least in part on a comparison of a measurement of how closely each candidate student model of the plurality of candidate student models approximates the output of the given neighborhood. Further still, in some aspects, the measurement of how closely each candidate student model of the plurality of candidate student models approximates the output of the given neighborhood is based at least in part on a mean square error between an output of the given neighborhood based on a given input and an output of each candidate student model of the plurality of candidate student models based on the given input. In some aspects, the input comprises an output received from a neighborhood preceding the given neighborhood.
The present technology will now be described with respect to the following exemplary systems and methods.
A high-level system diagram 100 in accordance with aspects of the technology is shown in
Processing system 102 may be implemented on any type of computing device(s), such as any type of general computing device, server, or set thereof, and may further include other components typically present in general purpose computing devices or servers. Memory 106 stores information accessible by the one or more processors 104, including instructions 108 and data 110 that may be executed or otherwise used by the processor(s) 104. Memory 106 may be of any non-transitory type capable of storing information accessible by the processor(s) 104. For instance, memory 106 may include a non-transitory medium such as a hard-drive, memory card, optical disk, solid-state, tape memory, or the like. Computing devices suitable for the roles described herein may include different combinations of the foregoing, whereby different portions of the instructions and data are stored on different types of media.
In all cases, the computing devices described herein may further include any other components normally used in connection with a computing device such as a user interface subsystem. The user interface subsystem may include one or more user inputs (e.g., a mouse, keyboard, touch screen and/or microphone) and one or more electronic displays (e.g., a monitor having a screen or any other electrical device that is operable to display information). Output devices besides an electronic display, such as speakers, lights, and vibrating, pulsing, or haptic elements, may also be included in the computing devices described herein.
The one or more processors included in each computing device may be any conventional processors, such as commercially available central processing units (“CPUs”), graphics processing units (“GPUs”), tensor processing units (“TPUs”), etc. Alternatively, the one or more processors may be a dedicated device such as an ASIC or other hardware-based processor. Each processor may have multiple cores that are able to operate in parallel. The processor(s), memory, and other elements of a single computing device may be stored within a single physical housing, or may be distributed between two or more housings. Similarly, the memory of a computing device may include a hard drive or other storage media located in a housing different from that of the processor(s), such as in an external database or networked storage device. Accordingly, references to a processor or computing device will be understood to include references to a collection of processors or computing devices or memories that may or may not operate in parallel, as well as one or more servers of a load-balanced server farm or cloud-based system.
The computing devices described herein may store instructions capable of being executed directly (such as machine code) or indirectly (such as scripts) by the processor(s). The computing devices may also store data, which may be retrieved, stored, or modified by one or more processors in accordance with the instructions. Instructions may be stored as computing device code on a computing device-readable medium. In that regard, the terms “instructions” and “programs” may be used interchangeably herein. Instructions may also be stored in object code format for direct processing by the processor(s), or in any other computing device language including scripts or collections of independent source code modules that are interpreted on demand or compiled in advance. By way of example, the programming language may be C#, C++, JAVA or another computer programming language. Similarly, any components of the instructions or programs may be implemented in a computer scripting language, such as JavaScript, PHP, ASP, or any other computer scripting language. Furthermore, any one of these components may be implemented using a combination of computer programming languages and computer scripting languages.
The computing devices may comprise a speech recognition engine configured to convert a speech input by a user into a microphone associated with the computing device into text data. Such an input may be a user query directed towards, for example, an automated assistant accessible through the computing device. The text data generated from the user voice input may be processed using any of the methods described herein to tokenize the text data for further processing. The tokenized text data may, for example, be processed to extract a query for the automated assistant that is present in the user voice input. The query may be sent to the automated assistant, which may in turn provide one or more services to the user in response to the query via the computing device.
The present technology provides methods and systems for breaking a teacher network down into smaller sub-networks, referred to herein as neighborhoods, which may each be distilled independently and then reassembled into a student network. Because each neighborhood is simpler than the teacher network, each candidate student model for a given neighborhood can be trained in parallel, significantly reducing the time necessary for training each candidate student model. This may also reduce the number of comparisons that need to be made in order to identify the optimal architecture for the student network. In that regard, using neighborhood distillation, if there are K different configurations tried for n different neighborhoods, the processing system must only train K x n candidate student models in order to identify the optimal combination of student models to include in the final student network. In contrast, using end-to-end distillation, the processing system would need to train every permutation, consisting of Kn full candidate student networks, in order to find the same optimal final student network. Finally, if the neighborhoods are kept sufficiently small, it becomes possible to train each candidate student model using general CPUs rather than requiring the custom accelerators (e.g., TPUs) traditionally used for direct learning, and to feed the networks random Gaussian noise as an input rather than requiring data similar to what the teacher network was originally trained on.
The present technology may be used to generate a final student network that is simpler than the teacher network. Alternatively or additionally, the present technology may be used to generate a student network that is optimized to run on a particular hardware platform (e.g. a CPU, a mobile device, etc.) that is different from the hardware platform on which the teacher network was optimized to run (e.g. a TPU, a GPU, an enterprise server, etc.).
In the example of
Trainers 212, 214, 216, and 218 may be any set of processes running serially or in parallel on processing system 102, or on a set of processing systems. As shown in
While the example of
The selections shown in
Trainers 212, 214, 216, and 218 may be configured to assess candidate student models and make their selections based on any suitable criteria. For example, the trainers may be configured to select the simplest or smallest candidate student model (e.g., the candidate student model with the fewest layers and/or parameters) that meets a certain threshold accuracy when compared to its respective teacher network neighborhood (e.g., accuracy within x% of the teacher network neighborhood, calculated using an average mean square error over some predetermined number of outputs). Further in that regard, in some aspects of the technology, the accuracy of the candidate student model may be assessed by assembling it with one or more preceding or trailing teacher network neighborhoods to form an intermediate model. Thus, for example, the accuracy of student model 209b may be assessed by creating a first intermediate model comprised of neighborhood 204, neighborhood 206, and student model 209b, and comparing its performance to a second intermediate model comprised of neighborhood 204, neighborhood 206, and neighborhood 208. In such a case, inputs may be passed into neighborhood 204 of the first and second intermediate models, and the resulting outputs from student model 209b and neighborhood 208 may then be compared.
Likewise, the trainers may be configured to select the most accurate candidate student model that falls below a threshold size or simplicity (e.g., a size less than or equal to y% of the teacher network neighborhood, or a size with less than z layers or parameters). In addition, while the selected candidate student models may in some cases be simpler or smaller than their respective teacher network neighborhoods, they need not be in all cases. For example, in instances where the teacher network is optimized for inference to take place on one type of platform (e.g., a TPU or enterprise server) and is being distilled in order to create a student network that can be used by a different hardware platform (e.g., a PC or mobile device), the selected student models (or the student network as a whole) may ultimately be as complex or more complex than their respective teacher network neighborhoods (or the teacher network as a whole) but still be desirable due to their compatibility.
In addition, as shown with respect to trainer 216, a trainer may also be configured to choose not to replace a given teacher network neighborhood with one of the candidate student models, and to instead simply include a copy of the given teacher network neighborhood in the student network. The trainer may be configured to do this, for example, where none of the candidate student models are found to mimic the output of the teacher network neighborhood with acceptable accuracy, or where none of the acceptably accurate candidate student models are found to be acceptably small and/or simple relative to the given teacher network neighborhood.
In some aspects of the technology, the trainers may be configured to train each candidate student model with the goal of minimizing a loss function measuring a difference between the output of the teacher network and the output of the candidate student model. For example, the trainers may be configured to train each candidate student model by minimizing the mean square error between its output and the output of the teacher network neighborhood. In that regard,
In the example of
In the example of
In the example of
Based on the signal passed to them by the teacher root 304, the teacher network neighborhood 306 and the candidate student model 308 will both generate outputs. In that regard, in the example of
In the example of
As explained above, mean square error 314 may also be used by the trainer, alone or in combination with mean square error 310, to recursively generate training gradients and tune the parameters of the candidate student model until the look-ahead loss has been minimized. Here as well, in some aspects of the technology, the output from the teacher root 304 for each input may be pre-computed and stored before training so that those outputs may be fed directly into the candidate student model 308 and teacher network neighborhood 306. Likewise, in some aspects of the technology, the outputs from the teacher head 312a based on the signals received from teacher network neighborhood 306 (which in turn are based on each each input 302) may be pre-computed and stored before training so that only the output of teacher head 312b (i.e., the teacher head’s output based on what is received from the candidate student model 308) must be computed prior to calculating each mean square error 314.
In some aspects of the technology, after a full student model 220 has been generated as described above with respect to
In addition, in some aspects of the technology, after a full student model 220 has been generated as described above with respect to
In step 404, the processing system 102 generates a plurality of candidate student models for each given neighborhood. For example, in the context of
In step 406, the processing system 102 provides an input to each given neighborhood, and receives a first output from the given neighborhood based on that input. For example, in the context of
Likewise, in step 408, the processing system 102 provides the same input to each candidate student model of the plurality of candidate student models, and receives a plurality of second outputs based on those inputs. Thus, continuing with the example of
Then, in step 410, for each given neighborhood, the processing system 102 compares the first output (received in step 406) to each second output of the plurality of second outputs (received in step 408) to generate a first training gradient corresponding to each candidate student model of the plurality of candidate student models. These “first training gradients” may be generated, for example, as described above with respect to
Next, in step 412, for each given neighborhood, the processing system 102 modifies one or more parameters of each given candidate student model of the plurality of candidate student models based at least in part on the first training gradient corresponding to the given candidate student model. Moreover, as indicated by the arrow extending from step 412 back to step 406, once these parameters have been modified, steps 406-412 may be repeated so as to recursively feed new inputs to the given neighborhood and each candidate student model and tune the parameters of each candidate student model (based on newly generated training gradients) until the output of each candidate student model begins to approximate (or match) the output of the given neighborhood for each input. Here as well, the recursive subprocess of steps 406-412 may utilize some or all of the original dataset on which the first neural network was trained (e.g., a set of images, or set of text documents), a set of data similar to the dataset on which the teacher network 202 was trained (e.g., a different set of images, or a different set of text documents), Gaussian noise, or some combination thereof. Thus, continuing with the example of
Following the recursive subprocess of steps 406-412, the processing system 102 identifies, for each given neighborhood, a selected model. The selected model may be a copy of one of the plurality of candidate student models, or it may be a copy of the given neighborhood (e.g., in cases where none of the candidate student models is deemed acceptable). As discussed above with respect to
Finally, in step 416, the processing system 102 combines the selected model corresponding to each given neighborhood of the plurality of neighborhoods to form the second neural network. Thus, continuing with the example of
In that regard, in step 504, the processing system 102 provides the first output (received from each given neighborhood in step 406 of
Likewise, in step 506, the processing system 102 provides each second output of the plurality of second outputs (received from each candidate student model in step 408 of
In step 508, for each given neighborhood, the processing system 102 compares the third output (received in step 504) to each fourth output of the plurality of fourth outputs (received in step 506) to generate a second training gradient corresponding to each candidate student model of the plurality of candidate student models. These second training gradients may be generated, for example, as described above with respect to
Finally, in step 510, for each given neighborhood, the processing system 102 modifies one or more parameters of each given candidate student model of the plurality of candidate student models based at least in part on the second training gradient (generated in step 508) corresponding to the given candidate student model. In that regard, as method 500 is performed in addition to method 400 of
Further, in some aspects of the technology, steps 410 and 412 may be omitted from the combined methods of
Unless otherwise stated, the foregoing alternative examples are not mutually exclusive, but may be implemented in various combinations to achieve unique advantages. As these and other variations and combinations of the features discussed above can be utilized without departing from the subject matter defined by the claims, the foregoing description of exemplary systems and methods should be taken by way of illustration rather than by way of limitation of the subject matter defined by the claims. In addition, the provision of the examples described herein, as well as clauses phrased as “such as,” “including,” “comprising,” and the like, should not be interpreted as limiting the subject matter of the claims to the specific examples; rather, the examples are intended to illustrate only some of the many possible embodiments. Further, the same reference numbers in different drawings can identify the same or similar elements.
Filing Document | Filing Date | Country | Kind |
---|---|---|---|
PCT/US2020/039098 | 6/23/2020 | WO |