The present invention relates to artificial intelligence (AI), neural networks and machine learning, and in particular to a method, system and computer-readable medium for optimizing computation graphs by partial evaluations.
Modern AI frameworks allow the user to provide the data in a so called NCHW (batch_size, channels, height, width) or channel-first data layout or in a so called NHWC (batch_size, height, width, channels) or channels-last data layout. While these data layouts are easy to use and well established in the community, the performance of these AI frameworks is significantly lower than if the data would be organized in a way that perfectly fits the implementation and the hardware's memory system. Therefore, highly optimized neural network libraries such as the oneAPI Deep Neural Network (OneDNN) library, the CUDA Deep Neural Network (CUDNN) library and other similar libraries require to convert the data to an optimized memory layout to ensure peak performance.
An embodiment of the present invention provides a method for optimizing a neural network. The method comprises identifying parameters of a computation graph of the neural network that depend on input data as a computation part and parameters of the computation graph that are independent of the input data as a pre-evaluation part, splitting the computation graph into the pre-evaluation part and the computation part, and generating and applying a wrapper that performs a transparent mapping of data layouts of the pre-evaluation part.
Subject matter of the present disclosure will be described in even greater detail below based on the exemplary figures. All features described and/or illustrated herein can be used alone or combined in different combinations. The features and advantages of various embodiments will become apparent by reading the following detailed description with reference to the attached drawings, which illustrate the following:
Embodiments of the present invention provide a system, method and computer-readable medium for optimizing computation graphs by partial evaluations by identifying transformations that can be evaluated ahead of the execution of the model, and performing a pre-evaluation. This allows for reducing the computation and execution time needed to execute the model. The reduction of computation time needed also provides for additional computations to be performed, and/or allows to save computational resources, thereby reducing the computational cost of repetitious computations with a significantly improved computational run-time and without a loss of accuracy. Moreover, various embodiments of the present invention provide for enhanced transparency of parameter configuration within a neural network.
OneDNN for X86 instruction set architectures requires convolution inputs to be in a channel-blocked layout that splits the channel dimension into two parts. The channels get split into an inner and outer part, where the blocking size depends on the used vector instructions, e.g., AVX2: block_size=8, AVX512: block_size=16. This requires to reshape, permute and sometimes also to add padding to the original data. For recurrent neural network (RNN) layers, it uses a similar blocked format.
CUDNN requires convolution inputs to be in NHWC layout to map these inputs onto its high performance tensor cores. For RNN layers, CUDNN needs to merge all input parameters (bias, weights) into a single, large consecutive memory segment.
Long vector platforms such as the SX-AURORA of the company NEC CORP. with vector lengths of 256 and 512 elements benefit from padding the pixel-dimensions in pooling and convolution layers (indicated by the “HW” in NCHW and NHWC) to the size of their vector length, to prevent expensive boundary checks. This increases the memory size, but enables removing costly boundary checks during execution.
Also, the performance of matrix-multiplications (e.g., general matrix vector multiplication (GEMV), general matrix multiply (GEMM), etc.) is highly dependent on the used transposition of the input matrices. Usually, it is beneficial to vectorize the output channels of the layer.
However, AI frameworks typically only support the generic NHWC or NCHW layouts, such that, during the execution, the memory layout needs to be converted into the desired layout before the execution of each layer, and then converted back again, which wastes costly computational time and computational resources. Further, this process has to be repeated in every mini-batch and epoch during training, and is therefore executed thousands of times.
Another case where expensive repetition of computations occurs is when generating layers such as Arange, Zeros, Ones, Eye, Constant, or equivalents are used. For example, in bidirectional encoder representations from transformers (BERT) networks, if the user does not use all inputs, the unused inputs get automatically initialized with zeros, so that the following embedding layer can be statically evaluated.
Hardware specialized AI libraries require hardware specific memory layouts to achieve peak performance. However, due to the increased number of AI hardware platforms, AI frameworks hide this abstraction from the user, which results in higher execution times because layout transformation functions need to be executed at runtime.
Aspect (1): In an aspect (1), the present invention provides a method for optimizing a neural network. The method includes identifying parameters of a computation graph of the neural network that depend on input data as a computation part, and parameters of the computation graph that are independent of the input data as a pre-evaluation part, splitting the computation graph into the pre-evaluation part and the computation part, and generating and applying a wrapper that performs a transparent mapping of data layouts of the pre-evaluation part.
Aspect (2): In an aspect (2), the present invention provides the method according to the aspect (1), wherein the wrapper computes the transparent mapping between a default artificial intelligence (AI) framework layout and a compute library layout of the neural network, generates code implementing the transparent mapping between the default AI framework layout and the compute library layout, and generates a new neural network from the neural network by injecting the code into an execution of the neural network.
Aspect (3): In an aspect (3), the present invention provides the method according to the aspect (2), wherein the aspect (2) further includes executing the new neural network.
Aspect (4): In an aspect (4), the present invention provides the method according to the aspects (2) or (3), wherein the aspect (4) further includes exporting, storing or deploying the neural network, and reversing, by the wrapper, the transparent mapping back to the default AI framework layout.
Aspect (5): In an aspect (5), the present invention provides the method according to the aspects (1), (2), (3), or (4), wherein the transparent mapping of data layouts of the pre-evaluation part includes a parameter update.
Aspect (6): In an aspect (6), the present invention provides the method according to the aspects (1), (2), (3), (4), or (5), wherein the aspect further comprises performing the transparent mapping of data layouts of the pre-evaluation part, executing the neural network, and applying a gradient update to the transparently mapped data layout of the pre-evaluation part.
Aspect (7): In an aspect (7), the present invention provides the method according to the aspects (1), (2), (3), (4), (5), or (6), wherein the aspect further comprises performing the transparent mapping of data layouts of the pre-evaluation part, receiving a request to export the neural network from a current data layout to a subsequent data layout, and executing the transparent mapping of data layouts of the pre-evaluation part backwards.
Aspect (8): In an aspect (8), the present invention provides the method according to the aspects (1), (2), (3), (4), (5), (6), or (7), wherein the aspect further comprises performing the transparent mapping of data layouts of the pre-evaluation part and storing an output of the pre-evaluation part in the neural network, wherein the pre-evaluation part comprises a generative layer.
Aspect (9): In an aspect (9), the present invention provides the method according to the aspects (1), (2), (3), (4), (5), (6), (7), or (8), wherein handling the transparent mapping of the data layouts by the wrapper comprises receiving a parameter of the neural network, generating a new neural network with a new parameter, performing the transparent mapping of data layouts of the pre-evaluation part using the parameter of the neural network as an input and the new parameter of the new neural network as an output, and replacing the neural network with the new neural network.
Aspect (10): In an aspect (10), the present invention provides the method according to the aspects (1), (2), (3), (4), (5), (6), (7), (8), or (9), wherein handling the transparent mapping of the data layouts by the wrapper comprises detecting a data layout of the neural network, detecting a data layout of a target device that will deploy the neural network, creating a new neural network with the data layout of the target device, and replacing the neural network with the new neural network.
Aspect (11): In an aspect (11), the present invention provides the method according to the aspect (10), wherein the wrapper detects the data layout of the neural network and detects the data layout of the target device that will deploy the neural network in response to a user execution of the neural network.
Aspect (12): In an aspect (12), the present invention provides the method according to the aspects (1), (2), (3), (4), (5), (6), (7), (8), (9), (10), or (11), wherein the aspect further comprises detecting a data layout of the neural network, detecting a data layout of a target device that will deploy the neural network, performing the transparent mapping of data layouts of the pre-evaluation part, and replacing the neural network with a neural network that utilizes a data layout of the target device.
Aspect (13): In an aspect (13), the present invention provides the method according to the aspects (1), (2), (3), (4), (5), (6), (7), (8), (9), (10), (11), or (12), wherein the aspect further comprises removing, by the wrapper, a parameter of the neural network in response to a user input.
Aspect (14): In an aspect (14), the present invention provides a system including one or more hardware processors which, alone or in combination, are configured to provide for execution of the steps of identifying parameters of a computation graph of the neural network that depend on input data as a computation part and parameters of the computation graph that are independent of the input data as a pre-evaluation part, splitting the computation graph into the pre-evaluation part and the computation part, and generating and applying a wrapper that performs a transparent mapping of data layouts of the pre-evaluation part.
Aspect (15): In an aspect (15), the present invention provides the method according to a tangible, non-transitory computer-readable medium having instructions thereon which, upon being executed by one or more hardware processors, alone or in combination, provide for execution of the steps of identifying parameters of a computation graph of the neural network that depend on input data as a computation part and parameters of the computation graph that are independent of the input data as a pre-evaluation part, splitting the computation graph into the pre-evaluation part and the computation part, and generating and applying a wrapper that performs a transparent mapping of data layouts of the pre-evaluation par.
Considering the neural network computation graph 1 of
Embodiments of the present invention provide to precompute the transformation functions 10 to get a neural network 12 such as the one shown in
This approach according to embodiments of the present invention can also be applied to the previously mentioned generative layers, such as the “Zeros>Embedding” case. In this case, the two layers are precomputed and the output of the embedding is stored as the pre-evaluated parameters 16 in the optimized neural network.
Embodiments of the present invention also provide for implementing padded memory layouts and merging of parameters. Compute libraries provide functions, e.g., reorder function 10a, merge function 10b, and transpose function 10c, for implementation to compute the layers, e.g., RNN layer 6c, dense/GEMM layer 6d, etc., and AI frameworks can use the compute libraries to perform the computations of the layers. As illustrated by
Referring to the exemplary workflow 54 of
As an example of a memory layout transformation, such as a transformation that would be performed on the parameters of
In this example, the input and output data are arranged as “Batches”, “Channels”, “Y”, “X” and the weights are arranged as “OutChannels”, “InChannels”, “YKernel”, “XKernel”. However, in neural networks the pixel sizes are rarely dividable by 8 or 16 which is the single instruction multiple data (SIMD) length. Therefore Intel splits the channels dimension into “Batches”, “OuterChannels”, “Y”, “X”, “InnerChannels”, whereas InnerChannels has the same size as the SIMD length. This requires to add a padding if channels are not dividable by the SIMD length. With this adjustment, it is not necessary to have any expensive boundary checks for the channels dimension. Further, channels are chosen over pixels, as there can be one to three pixel dimensions but only one channel dimension and therefore it's easiest to vectorize just this one dimension.
An exemplary training pipeline is as follows:
There are epochs * len(dataset) iterations of the model. In each of these iterations, the AI frameworks would do the previously mentioned layout transformations. Embodiments of the present invention advantageously provide code to do these layout transformations automatically when executing output=model(input) the very first time through the wrapper that is being used. This code can be injected into an execution of a neural network, for example, as a preface or preliminary portion of the neural network. Without this wrapper, a manual implementation would look like the following code:
Embodiments of the present invention provide for at least the following improvements over existing technology:
1. Splitting the execution of a neural network into a partial evaluation and main computation graph by identifying all layers that are not dependent on runtime input data and moving them into the partial evaluation graph.
2. Using a wrapper that hides changes to the neural network from the user to enable transparently reconfiguring the number, shape, padding, data type and data layout of the parameters within the neural network.
3. Significantly reducing execution time by the pre-evaluable layers being executed only once and not within every iteration. Take, for example, a convolution that takes 10 ms with optimal memory layouts and that, for this convolution, converting from default to optimal layout requires 2 ms. Accordingly, at each iteration, the convolution takes 12 ms in total. If the conversion is moved, however, into a pre-evaluation step in accordance with embodiments of the present invention, the time for processing this layer is reduced by 17%. In a normal neural network setting, there are usually hundreds of these layers within which the operations are run for thousands of iterations during training. Accordingly, using a conservative estimate of 100 (layers)*10,000 (iterations)*2 ms for the conversion=2,000,000 ms=33 mins saving provided by embodiments of the present invention. Despite this significant reduction in execution time, resulting also in savings in computational processing power and computational resources, embodiments of the present invention do not have any negative impact on the accuracy of the process or peak memory consumption.
In an embodiment, the present invention provides a method comprising the following steps:
1. Analysis of the computation graph and looking for runtime data sources to determine which parts can be partially evaluated and which depend on the input data.
2. Splitting of the computation graph into the pre-evaluation and computation parts.
3. Generating a wrapper that handles the transparent mapping of data layouts of the networks needed by the different processor(s).
The contents of the following webpages is incorporated by reference herein: <<https://oneapi-src.github.io/oneDNN/dev_guide_reorder.html>>(DNNL Layer that performs the conversion); <<https://docs.nvidia.com/deeplearning/cudnn/developer-guide/index.html #tensor-ops-conv-functions-data-filter-formats>>(CUDNN layout requirements); <<https://pytorch.org/docs/stable/generated/torch.nn.Module.html #torch.nn.Module>>(PyTorch NN Module API, only contains “register_X” no “remove_X” function calls); <<https://docs.nvidia.com/deeplearning/cudnn/api/index.html #cudnnGetRNNWeightParams>>(method to determine address ranges in unified CUDNN RNN weight space, that combines all weights and bias in a single large memory segment); and <<https://pytorch.org/docs/stable/generated/torch.Tensor.to_mkldnn.html?highlight=mkldnn #torch.Tensor.to_mkldnn>>(allows to convert input data manually to MKLDNN/DNNL data format, but does not apply to parameters).
While subject matter of the present disclosure has been illustrated and described in detail in the drawings and foregoing description, such illustration and description are to be considered illustrative or exemplary and not restrictive. Any statement made herein characterizing the invention is also to be considered illustrative or exemplary and not restrictive as the invention is defined by the claims. It will be understood that changes and modifications may be made, by those of ordinary skill in the art, within the scope of the following claims, which may include any combination of features from different embodiments described above.
The terms used in the claims should be construed to have the broadest reasonable interpretation consistent with the foregoing description. For example, the use of the article “a” or “the” in introducing an element should not be interpreted as being exclusive of a plurality of elements. Likewise, the recitation of “or” should be interpreted as being inclusive, such that the recitation of “A or B” is not exclusive of “A and B,” unless it is clear from the context or the foregoing description that only one of A and B is intended. Further, the recitation of “at least one of A, B and C” should be interpreted as one or more of a group of elements consisting of A, B and C, and should not be interpreted as requiring at least one of each of the listed elements A, B and C, regardless of whether A, B and C are related as categories or otherwise. Moreover, the recitation of “A, B and/or C” or “at least one of A, B or C” should be interpreted as including any singular entity from the listed elements, e.g., A, any subset from the listed elements, e.g., A and B, or the entire list of elements A, B and C.
Priority is claimed to U.S. Provisional Patent Application No. 63/255,972, filed on Oct. 15, 2021, the entire disclosure of which is hereby incorporated by reference herein.
Number | Date | Country | |
---|---|---|---|
63255972 | Oct 2021 | US |