The present application claims priority under 35 U.S.C. § 119(a) to Korean Patent Application No. 10-2022-0060378, filed on May 17, 2022, which is incorporated herein by reference in its entirety.
Various embodiments generally relate to a device for partitioning a neural network and an operation method thereof.
As deep learning technologies for neural networks develop, various accelerator technologies using ASICs, GPUs, or FPGAs are being developed.
As a size of a neural network increases to improve service quality, a processing power of an accelerator for deep learning increases, and accordingly, a size of a semiconductor chip used in the accelerator also increases.
However, there is a limit in increasing the size of the semiconductor chip due to limitations in a circuit area and power consumption.
Accordingly, in order to process one complicated neural network, a technique of partitioning the complicated neural network into a plurality neural network partitions and a technique of processing the neural network partitions in different accelerators are used.
In this case, intermediate data generated during neural network processing is transferred between the accelerators. Because a size of the intermediate data is large, it takes a long time to transfer the intermediate data between the accelerators and thus a communication speed between the accelerators is slower, which may degrade the overall performance.
In order to solve this drawback, a technology of transmitting data via a host system equipped with multiple accelerators or a technology of compressing data by a transmitting accelerator and decompressing compressed data by a receiving accelerator is used.
However, the former technology requires a specialized interface such as NVLink, and the latter technology requires additional software and hardware in the accelerators for data compression and decompression.
In accordance with an embodiment of the present disclosure, a device for partitioning an input neural network may include an interposing circuit configured to determine a partitioning position at which the input neural network is to be partitioned, to interpose a partitioning layer in the input neural network at the partitioning position, and to output an entire neural network that is obtained by interposing the partitioning layer in the input neural network, the input neural network including a plurality of layers; a training circuit configured to train the entire neural network; and a partitioning circuit configured to divide the entire neural network into a plurality of neural network partitions by partitioning the partitioning layer.
In accordance with an embodiment of the present disclosure, a method for partitioning an input neural network may include determining a partitioning position at which the input neural network is to be partitioned, the input neural network including a plurality of layers; interposing a partitioning layer in the input neural network at the partitioning position; outputting an entire neural network that is obtained by interposing the partitioning layer in the input neural network; training the entire neural network; and dividing the entire neural network into a plurality of neural network partitions by partitioning the partitioning layer.
The accompanying figures, where like reference numerals refer to identical or functionally similar elements throughout the separate views, together with the detailed description below, are incorporated in and form part of the specification, and serve to further illustrate various embodiments, and explain various principles and advantages of those embodiments.
The following detailed description references the accompanying figures in describing illustrative embodiments consistent with this disclosure. The embodiments are provided for illustrative purposes and are not exhaustive. Additional embodiments not explicitly illustrated or described are possible. Further, modifications can be made to presented embodiments within the scope of teachings of the present disclosure. The detailed description is not meant to limit this disclosure. Rather, the scope of the present disclosure is defined in accordance with claims and equivalents thereof. Also, throughout the specification, reference to “an embodiment” or the like is not necessarily to only one embodiment, and different references to any such phrase are not necessarily to the same embodiment(s).
The device 100 includes an interposing circuit 110, a training circuit 120, and a partitioning circuit 130.
The interposing circuit 110 interposes a partitioning layer at a position where the input neural network is to be partitioned and outputs an entire neural network that is generated by interposing the partitioning layer in the input neural network.
In
The neural network includes a plurality of layers. Hereinafter, a set of data corresponding to neurons of each layer may be referred to as a ‘tensor.’
That is, a tensor output from one layer is used as an input for the next layer.
Referring to
In the present embodiment, it is assumed that the second layer L2 of
In this embodiment, the partitioning layer Lp has an autoencoder structure and includes an encoding layer Le and a decoding layer Ld. The partitioning layer Lp further includes an intermediate layer Li between the encoding layer Le and the decoding layer Ld.
The encoding layer Le and the intermediate layer Li correspond to an encoder, and the intermediate layer Li and the decoding layer Ld correspond to a decoder. An autoencoder includes the encoder and the decoder.
The number of neurons included in the intermediate layer Li is smaller than the number of neurons included in the encoding layer Le. That is, an input tensor of the encoder is encoded to output an encoded tensor having a smaller size than the input tensor.
The number of neurons included in the decoding layer Ld is the same as the number of neurons included in the encoding layer Le. Therefore, in the decoder, the encoded tensor, is decoded to output a decoded tensor having the same size as the input tensor, i.e., to recover an original tensor that is the input tensor.
When the input neural network is simply divided by the conventional technique at the second layer L2 without interposing the partitioning layer Lp, a tensor having a size corresponding to the second layer L2 must be transferred between accelerators.
In contrast, when the input neural network is divided with the partitioning layer Lp interposed at the second layer L2 as in the present embodiment, the encoded tensor having a smaller size than the tensor corresponding to the second layer L2 may be transmitted between accelerators.
Since both the encoder and the decoder of the autoencoder are neural networks that can be processed by a conventional accelerator, there is no need to change a hardware structure of the accelerator to apply the present technology.
A structure of the autoencoder itself is well known in the art, and may have a fully connected form, a convolutional form, or any of other forms.
The autoencoder illustrated in
Accordingly, assuming that a partitioning layer is interposed where spatial information is important, it is not desirable to use the fully connected autoencoder illustrated in
In this case, a two-dimensional convolutional autoencoder can be used as the partitioning layer. Through this, the number of channels can be reduced while maintaining the same feature map size.
In the present embodiment, it is assumed that one layer is included in each of the encoder and the decoder of the autoencoder, but the number of layers included in each of the encoder and the decoder may be increased according to another embodiment.
The convolutional neural network ResNet includes multiple stages having different channel depths, and the partitioning layer Lp can be placed between any two of the multiple stages. The partitioning layer Lp is placed at the end of a stage.
Referring to
The complete convolutional neural network UNet includes a plurality of skip connections as shown in
The partitioning layer Lp may be interposed at any position in the input neural network, but an appropriate position may be selected to optimize performance.
Returning to
If the input neural network has not been trained, i.e., the input neural network is not a pre-trained neural network, the training operation is performed on the entire neural network in which the partitioning layer is interposed.
In an embodiment, the training operation is performed on the entire neural network using a supervised training method with training data.
Since the training operation itself for the neural network including the autoencoder is well known in the prior art, a detailed description thereof will be omitted.
On the other hand, if the input neural network has already been trained by a previous training operation, i.e., the input neural network is a pre-trained neural network, an additional training operation should be performed on the partitioning layer.
For example, a neural network, such as ResNet-156, which has already been trained using training data such as ImageNet, can be provided as the input neural network.
Autoencoders tend to generate random values at the beginning of training, and accordingly catastrophic forgetting may occur in the input neural network when the training operation is performed on the entire neural network generated by interposing the partitioning layer in the input neural network that has been trained. As a result, training results of the input neural network, which were obtained by the previous training operation, can be invalid, so that training efficiency may be decreased.
Accordingly, when the input neural network has been trained by the previous training operation, it is necessary to train the partitioning layer while reflecting the training results of the input neural network as much as possible.
To this end, a two-phase training operation is performed.
The first phase training of the two-phase training operation is performed only on the partitioning layer while maintaining weights included in the pre-trained input neural network. In an embodiment, the first phase training is performed by adjusting weights of the partitioning layer with weights of the previously trained input neural network.
In general, a training operation of an autoencoder aims to ensure the generality of a trained neural network, i.e., to avoid the overfitting problem and to operate correctly even for input data not included in training data.
Therefore, it is important to perform a training operation while avoiding the overfitting problem.
In this embodiment, the training data is reinforced by generating variations of the training data using data augmentation known through prior articles such as Luis Perez et al., “The Effectiveness of Data Augmentation in Image Classification using Deep Learning,” arXiv preprint arXiv:1712.04621, 2017.
.
For example, various transformed image data may be augmented by applying distortion, color change, saturation, contrast, and brightness change to image data.
Through this, the autoencoder can learn various patterns and operate correctly on various input data.
The training operation of the autoencoder using the training data and augmented data is well known.
For example, the first phase training may be performed until a value of a loss function converges below a predetermined value.
When the first phase training operation is completed, the second phase training of the two-phase training operation is performed.
In the second phase training operation, fine-tuning is performed by retraining (or adjusting) weights of the entire neural network including the pre-trained input neural network and the partitioning layer.
The autoencoder outputs data related to an operation of the pre-trained input neural network through the first phase training.
Accordingly, the catastrophic forgetting caused by the autoencoder does not occur in the second phase training operation.
In the second case, during the first phase training of the autoencoder, the accuracy is lower than that of the first case, but after the first phase training, the performance in the second phase training is improved by about 4.1% compared to the first case.
Returning to
For example, the entire neural network illustrated in
As shown in
The second neural network partition receives and processes the encoded tensor output from the first neural network partition.
That is, the encoded tensor is transmitted from the first accelerator 210 to the second accelerator 220.
As described above, since the size of the encoded tensor is smaller than that of the tensor before the encoding, an overhead due to communication between the first and the second accelerators 210 and 220 is reduced compared to the conventional technique where the input neural network is simply split without the encoding using the partitioning layer.
Hereinafter, the contribution of the partitioning layer on the accuracy of the input neural network is disclosed.
If the partitioning layer is interposed in the input neural network, the accuracy of the input neural network is lowered.
The input neural network used in an experiment is a fully connected neural network having two inner layers each containing 512 neurons.
The entire neural network is a neural network in which an autoencoder is placed between the two inner layers of the input neural network.
Both the input neural network and the entire neural network have been trained.
The accuracy of the entire neural network is decreased by 0.05% on average compared to that of the input neural network, indicating that there is no significant difference in accuracy even though the autoencoder is interposed in the input neural network.
In this experiment, the training was performed using the data set MNIST, and a categorical cross-entropy loss function was used.
As shown in the graph of
However, it can be seen that the entire neural network can achieve substantially the same level of loss as the input neural network by increasing the number of epochs of the training.
Through the graphs of
Next, a technique for determining a position for interposing the partitioning layer will be described.
As described above, the partitioning layer may be disposed at any position in the input neural network. However, it is desirable to determine an optimal position to improve processing performance of an accelerator.
Assuming that two neural network partitions are generated by interposing the partitioning layer and they are allocated to two accelerators, a position where execution times of the two accelerators are substantially equal to each other is selected as the optimal position according to an embodiment.
In this case, the execution time includes a computation time in the accelerators and a communication time (or transmission time) required to transmit tensors.
The communication time may include a communication time between the accelerators and/or a communication time between a host system and the accelerators.
The communication time is proportional to a size of a tensor to be transmitted in a given bandwidth, and the computation time can be given as a first-order function of floating-point operations per second (FLOPS).
In this embodiment, the following method is used to determine the optimal position to place the partitioning layer.
First, there are provided a plurality of cases respectively corresponding to a plurality of candidate positions where the partitioning layer may be placed. The number of the plurality of cases, that is, the number of partitioning cases, is assumed to be N, which is a natural number greater than 1.
For each of the partitioning cases, it is assumed that each neural network partition is assigned to one accelerator, and the number of neural network partitions, that is, the number of accelerators, is assumed to be k.
In each of the partitioning cases, an execution time execi of an i-th accelerator is given by Equation 1. Here, i is one of the natural numbers from 1 to k.
execi=compi+commi [Equation 1]
In Equation 1, compi represents a computation time of the i-th accelerator, and commi represents a transmission time of a tensor at the i-th accelerator.
As shown in Equation 2, the maximum value among execution times of the k accelerators is expressed as execj.
execj=max1≤i≤kexeci [Equation 2]
An evaluation value En corresponding to an n-th partitioning case is determined as in Equation 3, where n is one of the natural numbers from 1 to N.
At this time, in the present technology, the evaluation value En corresponding to the n-th partitioning case is determined by using the sum of differences between the execution time execi and the maximum execution time execj.
As the execution times of the k accelerators have similar values, each of the differences from the maximum value execj becomes smaller, and the evaluation value En of Equation 3 has a smaller value.
In the present embodiment, the partitioning case in which the evaluation value En of Equation 3 has the smallest value, among the N partitioning cases is selected, and accordingly, the optimal position of the partitioning layer is determined to be a position corresponding to the selected partitioning case.
When the optimal position to interpose the partitioning layer is determined, additional optimization may be performed in consideration of a data reduction rate R.
The data reduction rate R is given by a ratio of a size Te of an encoded tensor to a size Tp of an original tensor before the partitioning as shown in Equation 4.
Taking
As the size Te of the encoded tensor decreases, a size of data transmitted between the accelerators decreases and the data reduction rate R increases accordingly.
The performance is represented by a data reduction rate R. As the data reduction rate R increases, the performance improves, and as the data reduction rate R decreases, the performance deteriorates.
Each experiment was performed on three neural networks: ResNet, UNet, and EfficientNet.
Referring to
Therefore, an appropriate level of the data reduction rate R may be determined in consideration of the trade-off between the accuracy loss and the performance.
Taking the neural network ResNet as an example, when the data reduction rate R is 64, the accuracy loss is only 0.4%, but when the data reduction rate R increases lager than 64, the accuracy loss increases more rapidly. Therefore, the data reduction rate R for optimizing the performance can be determined as 64.
A specific data reduction rate may be easily determined by a person skilled in the art according to embodiments.
In order to derive a result as shown in
In this embodiment, the relationship between the data reduction rate R and the accuracy loss was derived by training only the partitioning layer, that is, the autoencoder, instead of training the entire neural network.
For example, it is assumed that the neural network is divided into two neural network partitions because there is only one partitioning layer.
Among the two neural network partitions, a first neural network partition that provides a signal to the partitioning layer is represented by FL(x) and a second neural network partition that receives an output of the partitioning layer is represented by FR(x′), where x is an input tensor (e.g., an input image) that is input to the entire neural network and x′ corresponds to a tensor generated by the partitioning layer.
Autoencoders generally aim to make a tensor output therefrom similar to a tensor input thereto. Accordingly, in the present embodiment, it can be seen that the encoder of the partitioning layer effectively performs an encoding operation when the data reduction rate R reaches a certain degree while satisfying FL(x)x′.
The graph of
The vertical axis of
It can be seen from the graph of
The graphs of
As a result of comparing the results of the two experiments based on the neural network ResNet, it is noted that there was a 76.9 times of speed increase when only the autoencoder is trained.
The graph of
Taking the neural network ResNet as an example, it can be seen that there is a performance improvement and energy saving of about 20% for all data reduction rates.
The neural network EfficientNet shows performance and energy saving results similar to those of the neural network ResNet. In case of the neural network UNet, it can be seen that performance improvement and energy saving thereof are further increased compared to those of the neural networks ResNet and EfficientNet.
Although various embodiments have been illustrated and described, various changes and modifications may be made to the described embodiments without departing from the spirit and scope of the invention as defined by the following claims.
Number | Date | Country | Kind |
---|---|---|---|
10-2022-0060378 | May 2022 | KR | national |