The recent surge in the performance of machine intelligence systems is not due to the development of revolutionary new algorithms. Indeed, the core algorithms used in machine intelligence applications today stem from a body of work that is now over half a century old. Instead, it has been improvements in the hardware and software that implement machine intelligence algorithms in an efficient manner that has fueled the recent surge. Algorithms that were once too computationally intensive to implement in a useful manner with even the most sophisticated of computers can now be executed with specialized hardware on an individual user's smart phone. The improvements in hardware and software take various forms. For example, graphical processing units traditionally used to process the vectors used to render polygons for computer graphics have been repurposed in an efficient manner to manipulate the data elements used in machine intelligence processes. As another example, certain classes of hardware have been designed from the ground-up to implement machine intelligence algorithms by using specialized processing elements such as systolic arrays. Further advances have centered around using collections of transistors and memory elements to mimic, directly in hardware, the behavior of neurons in a traditional artificial neural network (ANN). There is no question that the field of machine intelligence has benefited greatly from these improvements. However, despite the intense interest directed to these approaches, machine intelligence systems still represent one of the most computationally and energy intensive computing applications of the modern age, and present a field that is ripe for further advances.
The reason machine intelligence applications are so resource hungry is that the data structures being operated on are generally very large, and the number of discrete primitive computations that must be executed on each of the data structures are likewise immense. A traditional ANN takes in an input vector, conducts calculations using the input vector and a set of weight vectors, and produces an output vector. Each weight vector in the set of weight vectors is often referred to as a layer of the network, and the output of each layer serves as the input to the next layer. In a traditional network, the layers are fully connected, which requires every element of the input vector to be involved in a calculation with every element of the weight vector. Therefore, the number of calculations involved increases with a power law relationship to the size of each layer. Furthermore, this aspect of machine intelligence algorithms make them difficult to parallelize because the calculations for each layer depend on the output of the prior layer.
The problems mentioned in the prior paragraph are further exacerbated by modern ANNs. Modern ANN approaches are often referred to in the industry and literature as “deep learning” approaches. This is often a reference to the large number of layers involved, or the complexity of the relationships between the outputs of one layer and the inputs of the other layers. For example, in a modern deep learning ANN, the outputs of a downstream layer could be fed back to a prior layer which thereby adds a recursive element to the overall computation. Both the increase in layers, and the additional complexity associated with recursive relationships between the layers, increase the computational resources needed to implement a modern ANN.
The edges of directed graph 100 represent calculations that must be conducted to execute the graph. In this example, the graph is broken into two sections—a convolutional section 102 and a fully connected section 103. The convolutional portion can be referred to as a convolutional neural network (CNN). The vertices in the directed graph of CNN 102 form a set of layers which includes layers 106, 107, and 108. The layers each include sets of tensors such as tensors 109, 110, and 111. The vertices in the directed graph of fully connected section 103 also form a set of layers which includes layers 112 and 113. Each edge in directed graph 100 represents a calculation involving the origin vertex of the edge. In CNN 102, the calculations are convolutions between the origin vertex and a filter. Each edge in CNN 102 is associated with a different filter F11, Fn1, F12, Fn2 etc. As illustrated, filter F12 and tensor 109 subjected to a full convolution to generate one element of tensor 111. Filter F12 is “slid around” tensor 109 until a convolution operation has been conducted between the filter and the origin vertex. In other approaches, filter F12 and a portion of tensor 109 are multiplied to generate one element of tensor 111 and the full convolution is used to generate multiple elements of tensor 111. In fully connected section 103, the calculations are multiplications between a set of weights and the values from the prior layer. In fully connected section 103, each edge is associated with a unique weight value that will be used in the calculation. For example, edge 114 represents a multiplication between weight wn and input value 115. The value of element 116 is the sum of a set of identical operations involving all the elements of layer 112 and a set of weight values that uniquely correspond to the origin vertex of each edge that leads to element 116.
Execution of directed graph 100 involves many calculations. In the illustration, dots are used in the vertical directions to indicate the large degree of repetition involved in the directed graph. Furthermore, directed graph 100 represents a relatively simply ANN, as modern ANNs can include far more layers with far more complex interrelationships between the layers. Although not illustrated by directed graph 100, the outputs of one layer can loop back to be the inputs of a prior layer to form what is often referred to as a recursive neural network (RNN). The high degree of flexibility afforded to a machine intelligence system by having numerous elements, along with an increase in the number of layers and complexity of their interrelationships, makes it unlikely that machine intelligence systems will decrease in complexity in the future. Therefore, the computational complexity of machine intelligence systems is likely to increase in the future rather than diminish.
A computer-implemented method for executing a directed graph, in which each step is conducted by a processor, is disclosed. The method includes deriving a simplified version of the directed graph and applying a pilot input tensor to the simplified version of the directed graph. The method also includes obtaining a collection of execution data during the application of the pilot input tensor to the simplified version of the directed graph. The method also includes applying a live input tensor to the directed graph and conditioning the execution of the directed graph, during the application of the live input tensor to the directed graph, using the collection of execution data. The method also includes obtaining an output tensor from the conditional execution of the directed graph.
A computer-implemented method for generating an inference from a neural network, in which each step is conducted by a processor, is disclosed. The method includes deriving a simplified version of the neural network. The method also includes applying a first input to the simplified version of the neural network. The method also includes obtaining a collection of execution data during the application of the first input to the neural network. The method also includes applying a second input to the neural network. The method also includes conditioning the computation of the neural network, during the application of the second input to the neural network. The method also includes obtaining an inference from the conditional computation of the neural network. The conditional computation of the neural network is conditioned using the execution data. The conditional computation of the neural network is less computationally intensive than a non-conditional computation of the neural network using the second input would have been.
The execution of a directed graph can be made less computationally intensive by executing a simplified version of the graph to obtain information that is used to condition a later execution of the graph. This assertion holds true so long as the additional computational requirements for deriving the simplified version, executing the simplified version, and conditioning the execution of the graph are less computationally intensive than the differential between the computational requirements of the conditional and unconditional execution of the graph. Furthermore, this assertion is only relevant to the extent that the conditional execution produces an actionable result. If the conditional execution produces a result that is wildly inaccurate, the savings in computational complexity are not worthwhile.
Certain approaches disclosed below allow for the conditional execution of a directed graph to be conducted in an efficient manner while maintaining fidelity to the unconditional execution. Accuracy can be maintained while recognizing an increase in efficiency via various approaches. For example, specific approaches to the derivation of the simplified version of the graph, specific methods for obtaining and applying the information used for the conditional execution, and specific methods for conditioning the execution itself allow for a high fidelity result to be produced by an efficient conditional execution. Approaches from each of these classes are described in detail below in turn. Although directed graphs that implement machine intelligence algorithms have been utilized as a common example throughout this disclosure, certain approaches disclosed below are more broadly applicable to any field concerned with the efficient computation of a directed graph.
The flow chart begins with step 201 of deriving a simplified version of the directed graph. The simplified version of the graph can be executed by the processor more efficiently than the directed graph itself. Approaches for deriving the simplified version of the directed graph are described below with reference to
The flow chart continues with steps 202 and 203 in which a pilot input tensor is applied to the simplified version of directed graph, and a collection of execution data is obtained during the application of the pilot input tensor. These steps are conducted to evaluate the response of the simplified version of the directed graph in order to determine which portions of the graph have less of an impact on the overall execution. The obtained information can then be used at a later time to make the execution of the actual directed graph more efficient. Approaches for obtaining and storing the execution data are described below with reference to
Steps 202 and 203 are illustrated as sequential because the execution data is generally available for storage in memory after the input tensor has been applied and the graph has completed execution. This is because the actual contribution of different portions of the graph to the final output might not be known with certainty until the entire graph has been executed and the output tensor has been obtained. However, depending upon what execution data is obtained, step 203 may be completed prior to the complete execution of the directed graph.
Data flow diagram 210 represents the pilot input tensor X being applied to the simplified version of the directed graph 212 to produce execution data 213. The execution data 213 is represented as a markup of the simplified version of the directed graph wherein highlighted portions are identified as having a near negligible contribution to the output tensor. However, the execution data can take on numerous other forms.
The flow chart continues with steps 204 and 205 in which a live input tensor is applied to the directed graph, in step 205, and the directed graph is conditionally executed using the collection of execution data, in step 204. The flow chart completes in step 206 when an output tensor is obtained from the conditional execution of the directed graph. The steps are conducted to execute the originally desired computation against the original directed graph in a more efficient way through the use of the execution data obtained in step 203. The execution data may provide an estimate of which portions of the directed graph can be computed in a more efficient, but less accurate, fashion without impacting the fidelity of the directed graph execution. As such, they provide information concerning the tradeoff between computing efficiency and accuracy. The output tensor obtained in step 206 will therefore be similar to the output tensor that would have been obtained if directed graph 211 was not conditionally executed, but will be obtained with less computing resources. Approaches for conditioning the execution of the directed graph to obtain the output tensor in response to the application of the live input tensor are described below with reference to
Steps 204 and 205 are illustrated as both stemming from step 203 and leading to step 206 because they can be executed in either order or simultaneously. For example, the execution data can be used to modify the directed graph before the input tensor is applied by changing the values associated with the vertices or edges of the graph. In the example of a machine intelligence system, such an approach could involve rounding or down-sampling the values associated with the weights or filters of the system prior to the application of an input to the system. As another example, the execution data can be used to condition execution of the directed graph by inhibiting specific calculations in real time as they are set to occur.
Data flow diagram 210 represents the live input tensor X being applied to directed graph 211 overlain with execution data 213. The execution of the directed graph is illustrated as producing output vector Y. In keeping with the above explanations of the data flow diagram, the execution data 213 could represent portions of the directed graph that have a negligible impact on the output tensor which are therefore inhibited during the conditional execution of directed graph 211 with input tensor X. The live input tensor and pilot input tensor are both identified using the reference character X. This is because benefits arise from having the two tensors be similar. In particular, in the machine intelligence space, many systems are based around a classification problem in which the input is recognized as belonging to a specific class. Therefore, the directed graph may have widely different responses based on the class of the input vector. Generally, the pilot input tensor and live input tensor should be stochastically dependent to assure that actionable information is obtained from the simplified execution of the directed graph.
The methods illustrated by flow chart 200 can be applied to the computation of a neural network. Directed graph 211 could be a neural network and a set of edges of the directed graph could be calculations involving a set of weights for the neural network. If the neural network involved convolutional layers, the set of edges could also include calculations involving the convolution of a set of values with a filter of the neural network. The input tensor, the weights, and the filters could all be four or five dimensional tensors. The filters and weights of the neural network could be altered during training of the neural network. The execution of the directed graph could be conducted during training or during deployment of the neural network for purposes of obtaining inferences tensors from the neural network. The inference tensors could be a response of the neural network to the live input tensor. The conditional execution of the directed graph could produce the inference tensor in an efficient manner compared to the non-conditional execution of the directed graph. For example, the conditional computation of the neural network could be less computationally intensive than a non-conditional computation of the neural network using the same input tensor, but the inference tensor could be equivalent to the inference tensor that would have been produced during the non-conditional execution. For example, if accuracy has been maintained, the conditional execution of a directed graph with a neural network used for classification would produce the same class in response to a given input tensor as the unconditional execution of the directed graph using the same input tensor.
The simplified version of the directed graph can be executed more efficiently than the directed graph itself. The simplified version of the directed graph will generally be derived from the original version of the directed graph. The reason for using the directed graph as a starting point is that the purpose of the simplified version of the directed graph is to provide an approximation of the behavior of the original directed graph. The simplified version of the directed graph may be a down-sampled version of the directed graph or a version in which individual values in the directed graph were rounded, replaced by more basic data structures, or otherwise simplified. An example of the replacement of a value with a more basic data structure is the replacement of a high precision floating point data structure with a low precision fixed point data structure. The simplified version of the directed graph could also exhibit more dramatic differences as compared to the original directed graph. For example, the simplified version could have vertices and edges associated with tensors of lower rank or dimensionality than those corresponding with the respective vertices and edges of the original directed graph. Furthermore, the simplified version of the directed graph could have inhibited edges, or vertices and edges that have been entirely removed, as compared to the original directed graph.
In situations in which the simplified version is derived via a down-sampling process, the down-sampling of the directed graph can be conducted in numerous ways. Generally, the deriving of the simplified version of the directed graph in step 201 would involve down-sampling the directed graph by a sampling factor, S. The simplified version of directed graph would thereby be a down-sampled version of the directed graph. For example, tensors associated with the vertices and edges of the directed graph could be down-sampled by a factor of S or by taking S neighboring elements along any number of dimensions and averaging them. In the specific example of a directed graph implementing an ANN, a one-dimensional layer of weights in the ANN could be down-sampled by grouping the weights into groups of five, with the two nearest neighbors to every fifth weight being pulled into a group, and averaging the values. In this example, the down-sampling rate S=5. The down-sampling could also be conducted in any dimension by any rate. The down-sampling can use basic averaging, a sync filter approach, or polynomial interpolation. In the specific approach of an ANN, the deriving of the simplified version of a neural network could include down-sampling a set of weight values, filter values, or any other value used in the computation of the neural network by a sampling factor using the above referenced approaches.
The simplified version of the directed graph could also be simplified in terms of resolution in that the individual elements associated with the edges and vertices of the directed graph could be simplified to more basic values. Generally, the original values associated with the execution of the directed graph could be replaced by replacement values to simplify the execution of the directed graph. The replacement values, along with any data from the original directed graph that was not replaced, would represent the simplified version of the directed graph. The replacement values can exhibit various relationships with the original values. For example, the replacement values can be rounded versions of the original values or similar values represented by more basic data structures than the original values. In a specific example, the original values can undergo a process that involves reducing a number of bits of the original values to obtain the replacement values. In situations in which the original values are represented by floating point data structures, the replacement values can be calculated using a set of exponents from the set of original values. As a specific example, if the original directed graph utilized floating point numbers, the simplified version could involve discarding the mantissas and using only the exponent, or the top N bits of the exponent, to roughly represent each value in the directed graph at runtime. As another example, only the sign of the original value could be used. As another example, only a single bit could be utilized for each quantity. The resulting binarized network could be execute with high efficiency, and careful selection of the cutoff value for the binary sorting of the values could help to maintain fidelity in the performance of the simplified graph execution. These approaches could also be combined in various ways. For example, the approach of rounding off the values could be combined with replacing the value with a different data structure where the rounding interval was selected specifically to avoid the need for a higher resolution data structure.
With specific reference to an ANN, both the network and accumulation values could be replaced to simplify computation of the ANN. In an ANN with convolutional and fully connected layers, the weights and filter values could be rounded off to reduce the number of bits required to represent each value. The deriving of the simplified version of a neural network could include replacing a set of weight values, filter values, or any other value used in the computation of the neural network, with a set of replacement values using the above referenced approaches.
Once the simplified version of the directed graph is obtained, a pilot tensor is applied to the simplified version as described above with reference to step 202. The pilot tensor and simplified version of the directed graph are used to obtain relevant information regarding how the actual directed graph will respond when a live input tensor is applied to the directed graph. As such, the pilot input tensor can in some cases be identical to the live input tensor. However, the pilot input tensor can also be modified if needed to operate with the simplified version of the directed graph, or to further simplify execution of the simplified version of the directed graph. For example, the pilot input tensor could have a lower rank or dimensionality than the live input tensor if the simplified version of the directed graph was not compatible with the rank or dimensionality of the live input tensor. The pilot input tensor could also be a down sampled or otherwise simplified version of the live input tensor. For example, the pilot input tensor could be a version of the live input tensor in which the data structures used to store the values of the tensor have been replaced with more simplified structures. This approach could also be combined with one in which the directed graph itself was simplified in a similar manner. For example, if the simplified graph replaced 8-bit floating point values with 4-bit fixed point values, the pilot input tensor could do the same with the values from the live input tensor. In another class of approaches, the pilot input tensor will be a best guess attempt by another machine intelligence system to produce a tensor that will get sorted into the same class as the live input tensor. In general, the pilot input tensor will be stochastically related to the live input tensor so that the simplified directed graph will have a similar reaction to the pilot input tensor as the directed graph would have to the live input tensor.
When the pilot input tensor is applied to the simplified version of the directed graph, execution data is obtained that will be later used to condition the execution of the directed graph. The data is generally obtained during execution of the directed graph, but can be separate and distinct from the actual values that are produced to obtain the output of the directed graph. For example, the execution data can be a set of execution data values such as the outputs of each hidden layer in an ANN. However, the execution data values can also be derived from those values via a comparison or other computation. The execution data values can represent, or can be used to derive, an approximation of the relative importance of the computation from which they were generated on the overall execution of the directed graph. For example, the execution data values could each uniquely correspond with a set of vertices in the directed graph, each vertex in the set of vertices could product a contribution to the inference tensor produced by the directed graph, and each execution data value cold be proportional in magnitude to the contribution to the inference tensor of each vertex. The execution data values can correspond to any aspect of the directed graph and can represent the importance of that aspect of the directed graph in any number of ways. In specific approaches, the relative importance will be represented by set levels such as high, medium, or low. However, the relative importance could be represented by a numerical value that is proportional to an impact on the inference tensor of the corresponding aspect of the directed graph. The proportionality may be linear or logarithmic.
The execution data can be utilized to produce a markup of the simplified version of the directed graph which tags the directed graph with different levels of priority such as high, medium, or low. These priority values could then be stored in association with different portions of the directed graph. The different levels of priority can describe how much of a contribution to the output tensor the various portions of the directed graph contributed. The markup can have fixed gradations or can be a heat map with smooth transitions across the graph to indicate the various levels of priority. The priority values for each edge or vertex can be calculated in real time as the directed graph is executing calculations associated with that edge or vertex. For example, the magnitude of a specific computation can be used as a proxy for the priority of that computation, and the execution data can be saved as soon as the computation has been carried out. However, the values can also be updated continuously as the graph continues to carry out the overall computation. Such approaches are beneficial where downstream calculations effectively negate the perceived impact of upstream calculations. As such, the magnitude of downstream calculations can be fed back to impact the stored execution data from prior computations along the same path through the directed graph. The effect of this feedback can be tailored based on how many layers in the directed graph have passed between the value that is being updated and the newly obtained value.
The execution data can also be used to generate specific instructions for a later execution of the directed graph. For example, in the same way that the execution data can be used to generate a tag to indicate that a specific edge of the directed graph is of “low” priority, the execution data can also be used to generate an instruction to reduce the fidelity of the calculations associated with that edge of the directed graph, or to suppress the calculations associated with that edge of the directed graph. Specific approaches for conditioning the execution of the directed graph are discussed in more detail below. Many of these approaches can be triggered by reading the priority information from a tag, and triggering some form of conditional computation based off that tag. However, approaches in which the execution data is the instruction itself short circuits this intermediate lookup step by directly generating the instruction for how a portion of the directed graph should be executed at a later time.
The execution data can be stored in association with the portions of the directed graph to which they relate in various ways. For example, a markup could be stored in a distributed set of memory locations, or at a single memory location such that all of the data could be recalled using a single memory address or a contiguous sequence of memory addresses. The data can also be stored as an entirely separate data structure in memory. To use the example of 213, the heat map could be stored separately with priority levels and tags identifying specific portions of the graph. Alternatively, the data or markup can be stored directly within the data structures that represent the directed graph and can be obtained along with the data for the directed graph via a single address call to memory. For example, the execution data could be stored in packet headers where the payload of each packet was the data that represented the directed graph itself. To use the example of a directed graph that implements an ANN, the weights or filters of the ANN could be stored along with a value that represented the impact of that weight or filter on the output tensor in response to the pilot input tensor. In a specific example that is in accordance with this class of approaches, a priority value for a weight tensor and the weight tensor itself could be obtained from a memory location using a single memory address.
The execution data can be used to condition the execution of the directed graph in numerous ways. In general, the approaches used to simplify the directed graph for purposes of generating the simplified version of the directed graph can also be applied to condition the execution of the directed graph. However, as the conditional execution is being guided by information that has been obtained about the performance of the graph, the degree by which the computations are simplified can be much greater in the case of the conditioned execution than in the case of generating the simplified version. As stated previously, the steps associated with conditional execution in
The execution of the directed graph can be conditioned in numerous ways. Generally, the degree to which the computation is conditioned can be set to vary across the directed graph and can include various gradations that align with the relative priority of that portion of the graph. For example, regions of relatively high priority could be computed just as they would be in the unconditionally executed directed graph, while regions of relatively low priority could be excluded from computation entirely. The various approaches for conditional computation discussed below could be mixed and assigned in various ways to the levels of priority. For example, high, medium, and low priorities could be associated with three entirely separate conditional computation schemes. As another example, the conditional computation scheme could be held constant across the directed graph, but the relative accuracy of the scheme could be modified in accordance with the priorities. For example, a degree of rounding or down-sampling could be set proportional to the priority level with a smooth transition from original value execution, to rounded value execution, to execution conducted independently of the original values. Such approaches could be efficiently applied if the priority value was a smoothly varying numerical value.
The actual conditional execution of the directed graph can be conducted in various ways. The conditioning and the forms of conditional computation being separated concepts. Based on the execution data, the fidelity of various computations in the execution of the directed graph can be selectively decreased to different levels. For example, the conditional computation could involve decreasing the number of bits used to represent the inputs or outputs of a given computation. As another example, the data structure used to represent the inputs or outputs of a given computation could be simplified (e.g., from 8-bit floating point to 4-bit fixed point). As another example, the conditional computation could involve providing a fixed value in place of executing the computation. In one particular example, this value could be stored in a header of a data structure that would have been involved in the computation. As another example, the actual arithmetic portion of the computation could be simplified such that it discarded a certain number of LSBs from the computation. As another example, the computation could be suppressed altogether without even the need for providing a masked value. In even more specific approaches approaches, replacement values for the output of the computation could be stored downstream in association with later stages of the directed graph.
The simplified version of the direct graph can be stored and utilized in combination with various input pilot tensors in order to develop different execution data that depends on a particular live input tensor for which accurate conditional execution is required. The simplified version of the directed graph can also be recalculated if the directed graph is modified such as by training or some other update. In the specific example of an ANN, the simplified version of the directed graph can be automatically updated after each training session or after the activations have changed by a given delta. As another example, the simplified version of the directed graph can be recalculated every time a monitoring system determines that there is a sufficient break in the training or usage of the directed graph. As another example, the simplified version of the directed graph can be recalculated periodically if a monitoring system determines that it is no longer accurate. Such a monitoring system could be configured to periodically run the same input tensor against a simplified version of the directed graph and the directed graph and check the loss of fidelity against a predetermine metric. If the monitoring system detected that the loss of fidelity exceeded this metric, the simplified version of the directed graph could be recalculated.
In the specific application of an ANN the conditional computation can be used in both the generation of an inference tensor from the ANN and in training of the ANN. In approaches using back propagation, the updating of the weights during back propagation could be varied based on a known priority of that section of the network. For example, the degree to which weights are updated or modified could be limited by the priority of that portion of the ANN. Weights in highly sensitive and important portions of the neural network could be updated with high precision while weights in low sensitivity portions of the neural network could be kept constant during back propagation.
While the specification has been described in detail with respect to specific embodiments of the invention, it will be appreciated that those skilled in the art, upon attaining an understanding of the foregoing, may readily conceive of alterations to, variations of, and equivalents to these embodiments. Any of the method steps discussed above can be conducted by a processor operating with a computer-readable non-transitory medium storing instructions for those method steps. The computer-readable medium may be memory within a personal user device or a network accessible memory. The tensors used to implement the weights, accumulation values, filters, inputs, outputs, etc. of the systems described herein can all be four dimensional or five dimensional tensors. The directed graph and the simplified version of the directed graph described herein could be wholly different structures implemented in memory. However, the simplified version could be built off of the original data structure of the directed graph, and recalling the directed graph for later execution could comprise utilizing pointers to old values of the directed graph that were replaced during simplification. In this manner, overlapping values of the two versions of the graph would not need to take up more than one space in memory. Although examples in the disclosure were generally directed to machine intelligence systems, the same approaches could be utilized to any computationally intensive application involving the execution of a directed graph. Although examples in the disclosure were generally directed to ANNs, the same approaches could be utilized to enhance the operation of support vector machines, neuromorphic hardware generally, and any deep learning approach involving a complex set of layers. These and other modifications and variations to the present invention may be practiced by those skilled in the art, without departing from the scope of the present invention, which is more particularly set forth in the appended claims.
This application claims the benefit of U.S. Provisional Patent Application No. 62/483,133, filed Apr. 7, 2017, which is incorporated by reference herein in its entirety for all purposes.
Number | Date | Country | |
---|---|---|---|
62483133 | Apr 2017 | US |