This disclosure is related to the field of machine learning, and in particular, to Deep Neural Networks (DNNs).
A DNN is a type of machine learning that uses a set of connections modeled after the connections of neurons in the human brain to learn from examples. A DNN is represented as a hierarchical (layered) organization of nodes similar to the neurons with connections to other nodes. Inputs are primitive of objects to be classified, and output the set of classes aimed for. Each node in a hidden layer receives an input from nodes in a previous layer, combines the signals from nodes in the previous layer via a non-linear function, and passes its output to nodes in the next layer. The connection between two nodes of successive layers has an associated weight that defines the influence of the input to the non-linear function in the node that generates its output for the next node. During training of a DNN, training data (what input gives rise to what output) is fed to the DNN, and the weights of the connections are updated or optimized until the DNN has a desired accuracy. For example, if the DNN is being trained for object recognition, a large number of labeled images may be fed to the DNN until the DNN is able to produce sufficiently accurate classifications of the images on the training dataset.
DNNs may be used to carry out complex tasks such as machine vision, object recognition, disease diagnoses, bioinformatics, drug design, natural language processing, machine translation, etc. Modern computing power allows DNNs to grow deeper and bigger having millions of adjustable parameters (i.e., weights). However, a DNN having large parameter counts may require a system to have large storage and computational resources.
Described herein is a system and associated method of extracting a subnetwork from a DNN. As described above, a fully-connected DNN includes a plurality of nodes arranged in layers, with full connections between the nodes in consecutive layers. A system as described herein selects a minimal set of connections from the DNN (i.e., unmasked connections), and masks the remaining connections. The minimal set of unmasked connections is expected to have a training accuracy below an accuracy threshold. The system builds up from the minimal set of unmasked connections to discover a subnetwork that has the desired training accuracy but has a much smaller parameter count than the DNN. The system may also prune the subnetwork to further reduce the parameter count. One technical benefit is the resultant subnetwork has far lower storage and processing requirements than the full DNN, and may thus have applications in resource-limited devices, such as a mobile phone or a robot.
One embodiment comprises a subnetwork discovery system for a DNN comprising a plurality of nodes with connections between the nodes. The subnetwork discovery system comprises at least one processor and memory. The processor causes the subnetwork discovery system to initialize the DNN by assigning initial values to weights associated with the connections, and identify an initial subset of the connections randomly selected from the DNN. The initial subset after training with a training dataset has a training accuracy below an accuracy threshold. The processor causes the subnetwork discovery system to perform a growth process by adding additional connections from the DNN to the initial subset to generate a qualified subset of the connections. The qualified subset of the connections after training with the training dataset has a training accuracy that reaches the accuracy threshold.
In another embodiment, the processor causes the subnetwork discovery system to perform a pruning process to remove a portion of the connections from the qualified subset to generate a pruned subset of the connections. The pruned subset after training with the training dataset has a training accuracy that reaches the accuracy threshold.
In another embodiment, for the pruning process, the processor causes the subnetwork discovery system to reset the weights of the connections in the qualified subset to the initial values, optimize the weights of the connections in the qualified subset based on the training dataset, and remove a portion of the connections from the qualified subset based on a prune percentage.
In another embodiment, the processor causes the subnetwork discovery system to perform multiple iterations of the pruning process.
In another embodiment, the initial subset contains an initial percentage of the connections in the DNN, and the qualified subset contains a target percentage of the connections of the DNN. For the growth process, the processor causes the subnetwork discovery system to use a binary search to identify the target percentage for the qualified subset.
In another embodiment, for the binary search, the processor causes the subnetwork discovery system to:
In another embodiment, for the growth process, the processor causes the subnetwork discovery system to select a growth percentage, add additional connections to the initial subset based on the growth percentage to form a candidate subset of the connections, reset the weights of the connections in the candidate subset to the initial values, optimize the weights of the connections in the candidate subset based on the training dataset, determine the training accuracy of the candidate subset, identify the candidate subset as the qualified subset when the training accuracy reaches the accuracy threshold, and initiate another iteration of the growth process when the training accuracy is below the accuracy threshold.
In another embodiment, the initial subset contains an initial percentage of the connections in the DNN. The processor causes the subnetwork discovery system to select the initial percentage based on a ratio of a size of the training dataset, and a total number of the weights in the DNN.
Another embodiment comprises a method of processing a DNN comprising a plurality of nodes with connections between the nodes. The method comprises initializing the DNN by assigning initial values to weights associated with the connections, and identifying an initial subset of the connections randomly selected from the DNN. The initial subset after training with a training dataset has a training accuracy below an accuracy threshold. The method further comprises performing a growth process by adding additional connections from the DNN to the initial subset to generate a qualified subset of the connections. The qualified subset of the connections after training with the training dataset has a training accuracy that reaches the accuracy threshold.
In another embodiment, the method further comprises performing a pruning process to remove a portion of the connections from the qualified subset to generate a pruned subset of the connections. The pruned subset after training with the training dataset has a training accuracy that reaches the accuracy threshold.
In another embodiment, for the pruning process, the method further comprises resetting the weights of the connections in the qualified subset to the initial values, optimizing the weights of the connections in the qualified subset based on the training dataset, and removing a portion of the connections from the qualified subset based on a prune percentage.
In another embodiment, the method further comprises performing multiple iterations of the pruning process.
In another embodiment, the initial subset contains an initial percentage of the connections in the DNN, and the qualified subset contains a target percentage of the connections of the DNN. For the growth process, the method uses a binary search to identify the target percentage for the qualified subset.
In another embodiment, for the binary search, the method further comprises:
In another embodiment, for the growth process, the method further comprises selecting a growth percentage, adding additional connections to the initial subset based on the growth percentage to form a candidate subset of the connections, resetting the weights of the connections in the candidate subset to the initial values, optimizing the weights of the connections in the candidate subset based on the training dataset, determining the training accuracy of the candidate subset, identifying the candidate subset as the qualified subset when the training accuracy reaches the accuracy threshold, and initiating another iteration of the growth process when the training accuracy is below the accuracy threshold.
In another embodiment, the initial subset contains an initial percentage of the connections in the DNN, and the method further comprises selecting the initial percentage based on a ratio of a size of the training dataset, and a total number of the weights in the DNN.
Another embodiment comprises a system for processing a DNN comprising a plurality of nodes with connections between the nodes. The system comprises a means for initializing the DNN by assigning initial values to weights associated with the connections. The system further comprises a means for identifying an initial subset of the connections randomly selected from the DNN. The initial subset after training with a training dataset has a training accuracy below an accuracy threshold. The system further comprises a means for performing a growth process by adding additional connections from the DNN to the initial subset to generate a qualified subset of the connections. The qualified subset of the connections after training with the training dataset has a training accuracy that reaches the accuracy threshold. The system may further comprise a means for performing a pruning process to remove a portion of the connections from the qualified subset to generate a pruned subset of the connections. The pruned subset after training with the training dataset has a training accuracy that reaches the accuracy threshold.
Other embodiments may include computer readable media, other systems, or other methods as described below.
The above summary provides a basic understanding of some aspects of the specification. This summary is not an extensive overview of the specification. It is intended to neither identify key or critical elements of the specification nor delineate any scope of the particular embodiments of the specification, or any scope of the claims. Its sole purpose is to present some concepts of the specification in a simplified form as a prelude to the more detailed description that is presented later.
Some embodiments of the invention are now described, by way of example only, and with reference to the accompanying drawings. The same reference number represents the same element or the same type of element on all drawings.
The figures and the following description illustrate specific exemplary embodiments. It will thus be appreciated that those skilled in the art will be able to devise various arrangements that, although not explicitly described or shown herein, embody the principles of the embodiments and are included within the scope of the embodiments. Furthermore, any examples described herein are intended to aid in understanding the principles of the embodiments, and are to be construed as being without limitation to such specifically recited examples and conditions. As a result, the inventive concept(s) is not limited to the specific embodiments or examples described below, but by the claims and their equivalents.
Nodes 110 from adjacent layers have connections 112 or edges between them. DNN 100 may be fully-connected, which means every node 110 in one layer is connected every node 110 in the previous, adjacent layer. The connection 112 between two successive nodes 110 has an associated weight which defines the influence or significance of the input on a connection 112.
Node 110 also includes a net input function 204 and an activation function 206. Net input function 204 combines the input received over a connection 112 with the weight associated with the connection 112. For example, net input function 204 multiplies input0 received over connection 112-0 with weight 202-0 associated with connection 112-0. The weight 202 associated with a connection 112 therefore amplifies or dampens the input received over the connection 112. Net input function 204 then sums the input-weight product, and transfers the sum to activation function 206. Activation function 206 mimics a switch in determining whether and to what extent that signal should progress further through DNN 100 to affect the final output 142. For example, activation function 206 may compare the sum from net input function 204 to a threshold. When the sum is greater than or equal to the threshold, activation function 206 may “activate” and transfer the sum as output 208 over connections 112 to successive nodes 110. When the sum is less than the threshold, activation function 206 does not “activate” and does not transfer output 208 to the successive nodes 110. The configuration of node 110 as illustrated in
Training a DNN 100 can be a long and processing-intensive process, as the DNN 100 may include several thousands or millions of parameters (i.e., weights). For example, a DNN 100 may be initialized by randomly assigning initial values to the weights 202 of the connections 112. After initialization, the DNN 100 may be trained by processing samples of a training dataset to update the weights 202 of the connections 112. For example, a DNN 100 reviewing a sample in the form of an image may attempt to assign a classification to the image (e.g., “animal,” “car,” “boat,” etc.), or a DNN 100 reviewing a sample in the form of a sound may attempt to assign a classification to the sound (e.g., “voice,” “music,” “nature,” etc.). As an example of training in
When the number of parameters in the DNN 100 is large, the training process as described above can be long and processing-intensive. To overcome this and other problems, smaller subnetworks may be discovered in a DNN 100 that, when trained, reach a desired accuracy. A subnetwork as described herein is a combination of connections 112 and associated weights 202 from a DNN 100. The connections 112 in a subnetwork may be considered unmasked, while the remaining connections 112 of DNN 100 may be considered masked. A trainable subnetwork is a combination of connections 112 and associated weights 202 that are capable of learning with a desired accuracy.
A growth process is then performed to build up the untrainable subnetwork 302 by adding additional connections 112 from DNN 100 to the untrainable subnetwork 302 until a grown subnetwork 304 is discovered. Grown subnetwork 304 is considered a trainable subnetwork that, when trained, reaches the desired accuracy. Grown subnetwork 304 is more dense (i.e., has a larger number of connections 112) than untrainable subnetwork 302, but is more sparse than the DNN 100. For example, grown subnetwork 304 may comprise 6%, 12%, 15%, 20%, etc., of the connections 112 and associated weights 202 from DNN 100.
After discovering grown subnetwork 304, a pruning process may be performed on grown subnetwork 304 to remove the less influential connections 112 from grown subnetwork 304. For example, the connections 112 of grown subnetwork 304 having the smallest-magnitude weights 202 may be removed or masked, and the remaining, unpruned connections 112 form a pruned subnetwork 306. Pruned subnetwork 306 is also considered a trainable subnetwork that, when trained, reaches the desired accuracy.
One or more of the subsystems of subnetwork discovery system 400 may be implemented on a hardware platform comprised of analog and/or digital circuitry. One or more of the subsystems of subnetwork discovery system 400 may be implemented on one or more processors 420 that execute instructions 424 (i.e., computer program code) stored in memory 422. Processor 420 comprises an integrated hardware circuit configured to execute instructions 424, and memory 422 is a computer readable storage medium for data, instructions 424, applications, etc., and is accessible by processor 420. Subnetwork discovery system 400 may include additional components that are not shown for the sake of brevity, such as a network interface, a user interface, internal buses, etc.
For this embodiment, it is assumed that method 500 operates on DNN 100 as shown in
To discover a trainable subnetwork, method 500 first identifies or defines an untrainable subnetwork 302 that is so sparse that it is not trainable. To do so, growth subsystem 404 identifies an initial subset of connections 112 randomly selected from the DNN 100 (step 504 of
The initial percentage 611 may be selected with an expectation that the training accuracy of the initial subset 610 is below the accuracy threshold, such as with experience or empirical data. For example, an observation may be that a minimal percentage (e.g., 1%, 2%, etc.) of connections 112 is expected to have a training accuracy below the accuracy threshold. Subnetwork discovery system 400 may also verify that the training accuracy of the initial subset 610 is below the accuracy threshold. To do so, training subsystem 406 may optimize the weights 202 of the connections 112 in the initial subset 610 with an optimization algorithm (e.g., stochastic gradient descent) using the training dataset 410, and determine a training accuracy of the initial subset 610. If the training accuracy of the initial subset 610 is below the accuracy threshold, then the initial subset 610 defines an untrainable subnetwork 302.
Growth subsystem 404 may select the initial percentage 611 in a variety of ways based on constraints (optional step 520). In one embodiment, the initial percentage 611 may be selected based on the size of the DNN 100, such as the number of connections 112 or weights 202. When the size of the DNN 100 is larger, the initial percentage 611 selected may be smaller, and when the size of the DNN 100 is smaller, the initial percentage 611 selected may be larger. In another embodiment, the initial percentage 611 may be selected based on the size of the training dataset 410. The size of the training dataset 410 comprises the number of samples or instances of training data in the training dataset 410. For example, assume that DNN 100 is used for image recognition. The training dataset 410 may include hundreds or thousands of instances of training images. When the size of the training dataset 410 is larger, the initial percentage 611 selected may be larger, and when the size of the training dataset 410 is smaller, the initial percentage 611 selected may be smaller. In another embodiment, the initial percentage 611 may be selected based on a ratio of the size of the training dataset 410, and a total number of weights 202 in the DNN 100. In another embodiment, the initial percentage 611 may be selected from an acceptable range, such as a range of 2-10% of the total number of connections 112 in DNN 100.
With the initial subset 610 identified, growth subsystem 404 performs a growth process to generate a qualified subset of connections 112 built up from the initial subset 610 (step 506). A qualified subset is a percentage or fraction of the total number of connections 112 in DNN 100 greater than the initial percentage 611 used to form the initial subset 610. For the growth process, growth subsystem 404 adds additional connections 112 from DNN 100 to the initial subset 610 to build up from the baseline number of connections 112 that were initially selected. To generate the qualified subset 620 as shown in
The growth process from the initial subset 610 to the qualified subset 620 may vary as desired. In general, the growth process adds connections 112 to the initial subset 610 to build a qualified subset 620 that reaches the accuracy threshold. In one embodiment, subnetwork discovery system 400 may use a binary search, a golden ratio search, or the like for the growth process, as is described in more detail below. In another embodiment, subnetwork discovery system 400 may iteratively add a number or percentage of additional connections 112 to the initial subset 610 for the growth process, as is described in more detail below. However, other types of growth processes are considered herein.
Discovery of the qualified subset 620 of connections 112 is beneficial in that the qualified subset 620 represents a trainable subnetwork that has an acceptable training accuracy (i.e., equal to or greater than the accuracy threshold), but has a reduced parameter count (i.e., less weights) compared to DNN 100.
For method 500, subnetwork discovery system 400 may further reduce the parameter count of the qualified subset 620 by pruning the qualified subset 620. There may be connections 112 in qualified subset 620 that are less important and can be removed with minimal effect on accuracy. Thus, pruning subsystem 408 may perform a pruning process to generate a pruned subset of connections 112 from the qualified subset 620 (optional step 508). A pruned subset is a percentage or fraction of the total number of connections 112 in DNN 100 less than the target percentage 621 used to form the qualified subset 620. For the pruning process, pruning subsystem 408 removes or masks (i.e., eliminates) a portion or number of the connections 112 from the qualified subset 620 that are determined to be less influential, to form the pruned subset 630 as shown in
Discovery of the pruned subset 630 is beneficial in that the pruned subset 630 represents a trainable subnetwork that has an acceptable training accuracy (i.e., equal to or greater than the accuracy threshold), but has even a further reduced parameter count (i.e., less weights) compared to qualified subset 620.
The initial subset 610 selected above has a training accuracy below the accuracy threshold. Thus, growth subsystem 404 identifies the initial percentage 611 of the initial subset 610 as a lower bound of the search interval for the binary search (step 702). Growth subsystem 404 then establishes or selects an upper bound of the search interval for the binary search (step 704). The upper bound comprises a percentage that is larger than the initial percentage 611, which is referred to generally as an upper bound percentage. The upper bound percentage is selected to have a training accuracy that reaches the accuracy threshold. For example, the upper bound percentage may be 100% of the connections 112, 90% of the connections 112, 50% of the connections 112, or some other percentage that reaches the accuracy threshold. Thus, the search interval for the binary search is initially between the lower bound defined by the initial percentage, and the upper bound defined by the upper bound percentage.
Growth subsystem 404 then searches for a target percentage 621 between the upper and lower bounds. To do so, subnetwork discovery system 400 selects, determines, or identifies an intermediate percentage for the search interval (step 706). The intermediate percentage comprises a midpoint of the search interval. Initially, the intermediate percentage is between the lower bound defined by the initial percentage 611, and the upper bound defined by the upper bound percentage. Growth subsystem 404 then adds additional connections 112 to the initial subset 610 based on the intermediate percentage to form a candidate subset of connections 112 (step 708). Growth subsystem 404 randomly selects additional connections 112 from DNN 100 and adds the additional connections 112 to the initial subset 610 to form the candidate subset that contains the intermediate percentage of connections 112 and associated weights 202.
Training subsystem 406 resets the weights 202 of the connections 112 in the candidate subset to initial values 602 (step 710). Training subsystem 406 then optimizes or adjusts the weights 202 of the connections 112 in the candidate subset based on a training dataset 410 (step 712), such as with an optimization algorithm (e.g., stochastic gradient descent). After training the candidate subset based on a training dataset 410, growth subsystem 404 determines the training accuracy of the candidate subset (step 714). When the training accuracy is below the accuracy threshold (e.g., 90%), growth subsystem 404 narrows the search interval to an upper half of the search interval (step 716). When the training accuracy meets (i.e., is equal to or greater than) the accuracy threshold, growth subsystem 404 narrows the search interval to a lower half of the search interval (step 718).
When the search interval is above a threshold, subnetwork discovery system 400 repeats steps 706-718. When the search interval becomes sufficiently small and is below a threshold (e.g., 1%, 2%, etc.), subnetwork discovery system 400 has converged on a target percentage 621 of connections 112 from DNN 100 for the qualified subset 620 (see also,
Training subsystem 406 resets the weights 202 of the connections 112 in the candidate subset to the initial values 602 (step 806). Training subsystem 406 then optimizes or adjusts the weights 202 of the connections 112 in the candidate subset based on training dataset 410 (step 808), such as with an optimization algorithm (e.g., stochastic gradient descent). After training the candidate subset based on a training dataset 410, growth subsystem 404 determines the training accuracy of the candidate subset (step 810). When the training accuracy reaches the accuracy threshold, growth subsystem 404 identifies the candidate subset as a qualified subset 620 (step 812). When the training accuracy is below the accuracy threshold, growth subsystem 404 initiates another iteration where process 800 returns to step 802. Growth subsystem 404 may perform multiple iterations of process 800 to grow the initial subset 610 until it contains a sufficient number of connections 112 to form a qualified subset 620 that, after training, has a training accuracy that reaches the accuracy threshold.
In one embodiment, growth subsystem 404 may use the same growth percentage for each iteration of process 800. In another embodiment, growth subsystem 404 may increase the growth percentage for successive iterations of process 800. For example, growth subsystem 404 may double the growth percentage for successive iterations of process 800. In another embodiment, growth subsystem 404 may decrease the growth percentage for successive iterations of process 800. For example, growth subsystem 404 may divide the growth percentage in half for successive iterations of process 800.
Any of the various elements or modules shown in the figures or described herein may be implemented as hardware, software, firmware, or some combination of these. For example, an element may be implemented as dedicated hardware. Dedicated hardware elements may be referred to as “processors”, “controllers”, or some similar terminology. When provided by a processor, the functions may be provided by a single dedicated processor, by a single shared processor, or by a plurality of individual processors, some of which may be shared. Moreover, explicit use of the term “processor” or “controller” should not be construed to refer exclusively to hardware capable of executing software, and may implicitly include, without limitation, digital signal processor (DSP) hardware, a network processor, application specific integrated circuit (ASIC) or other circuitry, field programmable gate array (FPGA), read only memory (ROM) for storing software, random access memory (RAM), non-volatile storage, logic, or some other physical hardware component or module.
Also, an element may be implemented as instructions executable by a processor or a computer to perform the functions of the element. Some examples of instructions are software, program code, and firmware. The instructions are operational when executed by the processor to direct the processor to perform the functions of the element. The instructions may be stored on storage devices that are readable by the processor. Some examples of the storage devices are digital or solid-state memories, magnetic storage media such as a magnetic disks and magnetic tapes, hard drives, or optically readable digital data storage media.
As used in this application, the term “circuitry” may refer to one or more or all of the following:
(a) hardware-only circuit implementations (such as implementations in only analog and/or digital circuitry);
(b) combinations of hardware circuits and software, such as (as applicable):
(c) hardware circuit(s) and or processor(s), such as a microprocessor(s) or a portion of a microprocessor(s), that requires software (e.g., firmware) for operation, but the software may not be present when it is not needed for operation.
This definition of circuitry applies to all uses of this term in this application, including in any claims. As a further example, as used in this application, the term circuitry also covers an implementation of merely a hardware circuit or processor (or multiple processors) or portion of a hardware circuit or processor and its (or their) accompanying software and/or firmware. The term circuitry also covers, for example and if applicable to the particular claim element, a baseband integrated circuit or processor integrated circuit for a mobile device or a similar integrated circuit in server, a cellular network device, or other computing or network device.
Although specific embodiments were described herein, the scope of the disclosure is not limited to those specific embodiments. The scope of the disclosure is defined by the following claims and any equivalents thereof