The present disclosure relates to an information processing method and so forth executed by a computer.
Non patent literature (NPL) 1 proposes a method, relating to a neural network, of searching for a network architecture.
However, it is difficult for the method proposed in NPL 1 to find an inference model that is expected to reduce a loss caused by quantization.
In view of this, the present disclosure provides an information processing method and so forth capable of finding an inference model that is expected to reduce a loss caused by quantization.
The information processing method according to an aspect of the present disclosure is an information processing method executed by a computer. This information processing method includes: obtaining a first inference model serving as a reference; computing a second inference model that is larger than the first inference model in model size, based on the first inference model; quantizing the second inference model computed to generate a third inference model; training the third inference model, using machine learning; determining whether a performance of the third inference model trained satisfies a condition; and outputting the third inference model trained, when the performance satisfies the condition.
These general and specific aspects may be implemented using a system, a device, a method, an integrated circuit, a computer program, or a non-transitory computer-readable recording medium such as a CD-ROM, or any combination of systems, devices, methods, integrated circuits, computer programs, or recording media.
The information processing method and so forth according to an aspect of the present disclosure enables to find an inference model that is expected to reduce a loss caused by quantization.
Functions of deep learning-based inference are integrated into some of the Internet of things (IoT) devices. Also, from the cost and privacy point of view, inference processing is performed not in an environment such as a cloud or a GPU, but by a processor included in a device in some cases. In these cases, the size of a network (NW) is reduced, using a method such as quantization. Through this, deep learning-based inference processing is performed by a processor having limited computation resources such as computing power and memory capacity.
Here, the network refers to an inference model, such as a neural network model, for performing inference processing.
However, in quantization, for example, a reference network (RefNW) that uses floating-point representation is converted into an integrated environment network (IntNW) that uses fixed-point representation. Such quantization can result in a loss in inference performance. More specifically, accuracy can decrease and disagreement can occur between the inference results of the reference network and the integrated environment network.
NPL 1 proposes a method, relating to a neural network model, of searching for a network architecture. The method proposed in NPL 1 searches for a network architecture having high inference performance and high inference speed. The network architecture corresponds to the number of layers, the number of nodes of each layer, and inter-node connections, and so forth. Stated differently, the method proposed in NPL 1 searches for a network architecture that includes the number of layers, the number of nodes of each layer, inter-node connections, and so forth that achieve high inference performance and high inference speed.
However, the method proposed in NPL 1 fails to take into consideration a loss caused by quantization. For this reason, the use of the method proposed in NPL 1 can result in a loss in the inference performance due to quantization that is performed for reducing the network size.
In view of this, the information processing method according to an aspect of the present disclosure is, for example, an information processing method executed by a computer. This information processing method includes: obtaining a first inference model serving as a reference; computing a second inference model that is larger than the first inference model in model size, based on the first inference model; quantizing the second inference model computed to generate a third inference model; training the third inference model, using machine learning; determining whether a performance of the third inference model trained satisfies a condition; and outputting the third inference model trained, when the performance satisfies the condition.
With this, the second inference model that is larger than the first inference model in model size is quantized. It is assumed that the performance of the second inference model having a large model size is less subjected to a decrease even after being quantized. Stated differently, the third inference model that is generated by quantizing the second inference model that is larger than the first inference model in model size is assumed to be subjected to a relatively small loss caused by quantization. This thus enables to find an inference model that is expected to reduce a loss caused by quantization.
Also, the information processing method further includes for example: obtaining settings information indicating settings for the quantizing of the second inference model; and setting an initial value for the computing of the second inference model, based on the settings information and the first inference model.
With this, the computing of the second inference model is started on the basis of the settings information for quantization and the first inference model. This thus enables to find, at an early stage, the third inference model that is based on quantization and the first inference model.
Also, the information processing method further includes for example: obtaining difficulty level information indicating an inference difficulty level of at least one of the first inference model, the second inference model, or the third inference model; and setting an initial value for the computing of the second inference model, based on the difficulty level information and the first inference model.
With this, the computing of the second inference model is started on the basis of the difficulty level information of inference and the first inference model. This thus enables to find, at an early stage, the third inference model that is based on the inference difficulty level and the first inference model.
Also, for example, the computing of the second inference model is a search for the second inference model performed using a loss function, the loss function is a function whose output value decreases with a decrease in a difference between an inference result of the first inference model and an inference result of the third inference model, and whose output value decreases with an increase in the model size of the second inference model relative to the first inference model, and the search for the second inference model is performed to cause the output value of the loss function to decrease.
This enables to find an inference model that is expected reduce a loss caused by quantization, on the basis of the loss function.
Also, the information processing method further includes for example: obtaining settings information indicating settings for the quantizing of the second inference model; and changing the loss function, based on the settings information.
This enables to find an inference model that is expected reduce a loss caused by quantization, on the basis of the loss function that is based on the settings for quantization.
Also, for example, the loss function is changed to increase the output value of the loss function with an increase in a degree of the quantizing in the settings indicated by the settings information, and the search for the second inference model is performed to cause the output value of the loss function to be less than or equal to a threshold.
With this, although the output value of the loss function increases with an increase in the degree of quantization, the second inference model is searched for to cause the output value of the loss function to be less than or equal to the threshold. Stated differently, even when the loss is large due to a large degree of quantization, the second inference model that satisfies a certain condition for reducing the loss is searched for. This thus enables to find an inference model that is expected to reduce the loss at a constant level.
Also, the information processing method further includes for example: obtaining difficulty level information indicating an inference difficulty level of at least one of the first inference model, the second inference model, or the third inference model; and changing the loss function, based on the difficulty level information.
This enables to find an inference model that is expected reduce a loss caused by quantization, on the basis of the loss function that is based on the inference difficulty level.
Also, for example, the loss function is changed to increase the output value of the loss function with an increase in the inference difficulty level indicated by the difficulty level information, and the search for the second inference model is performed to cause the output value of the loss function to be less than or equal to a threshold.
With this, although the output value of the loss function increases with an increase in the inference difficulty level, the second inference model is searched for to cause the output value of the loss function to be less than or equal to the threshold. Stated differently, even when the loss is large due to a high inference difficulty level, the second inference model that satisfies a certain condition for reducing the loss is searched for. This thus enables to find an inference model that is expected to reduce the loss at a constant level.
Also, the information processing method further includes, for example, changing settings for the quantizing of the second inference model when the performance fails to satisfy the condition.
With this, there is a possibility that the settings for quantization are changed to satisfy the condition for the performance. This thus enables to find an inference model that is expected to satisfy the condition for the performance.
Also, for example, the condition includes accuracy or correctness of an inference of the third inference model with respect to an inference result of the first inference model or reference data, and the changing of the settings includes decreasing a degree of the quantizing when the accuracy or the correctness of the inference of the third inference model is less than or equal to a threshold.
With this, when the accuracy or the correctness of the inference of the third inference model is less than or equal to the threshold, the degree of quantization to be performed on the second inference model is decreased to improve the accuracy or the correctness of the inference of the third inference model. This thus enables to find an inference model that is expected to satisfy the condition for the accuracy and correctness of inference.
Also, for example, the condition includes a speed of inference processing of the third inference model, and the changing of the settings includes increasing a degree of the quantizing when the speed of the inference processing is less than or equal to a threshold.
With this, when the speed of the inference processing of the third inference model is less than or equal to the threshold, the degree of quantization to be performed on the second inference model is increased to increase the speed of the inference processing of the third inference model. This thus enables to find an inference model that is expected to satisfy the condition for the speed of the inference processing.
Also, the information processing method further includes for example: inputting data to the first inference model to obtain an inference result of the first inference model; inputting the data to the second inference model to obtain an inference result of the second inference model; and training the first inference model, based on a difference between the inference result of the first inference model and the inference result of the second inference model.
With this, the first inference model and the third inference model are built on the basis of the same second inference model. This thus reduces the difference between the inference result of the first inference model and the inference result of the third inference model.
Also, the information processing system according to an aspect of the present disclosure includes, for example, at least one processor and at least one memory. Using the at least one memory, the at least one processor: obtains a first inference model serving as a reference; computes a second inference model that is larger than the first inference model in model size, based on the first inference model; quantizes the second inference model computed to generate a third inference model; trains the third inference model, using machine learning; determines whether a performance of the third inference model trained satisfies a condition; and outputs the third inference model trained when the performance satisfies the condition.
With this, the second inference model that is larger than the first inference model in model size is quantized. It is assumed that the performance of the second inference model having a large model size is less subjected to a decrease even after being quantized. Stated differently, the third inference model that is generated by quantizing the second inference model that is larger than the first inference model in model size is assumed to be subjected to a relatively small loss caused by quantization. This thus enables to find an inference model that is expected to reduce a loss caused by quantization.
Further, these general and specific aspects may be implemented using a system, a device, a method, an integrated circuit, a computer program, or a non-transitory computer-readable recording medium such as a CD-ROM, or any combination of systems, devices, methods, integrated circuits, computer programs, or recording media.
Hereinafter, certain exemplary embodiments are described in greater detail with reference to the accompanying Drawings. Each of the exemplary embodiments described below shows a general or specific example. The numerical values, shapes, materials, elements, the arrangement and connection of the elements, steps, the processing order of the steps etc. shown in the following exemplary embodiments are mere examples, and therefore do not limit the scope of the appended Claims and their equivalents. Therefore, among the elements in the following exemplary embodiments, those not recited in any one of the independent claims are described as optional elements.
Also, in the present disclosure, the ordinal numbers, such as first, second, and third, are assigned to some of the elements. These ordinal numbers are assigned to the elements to identify the elements, and thus do not necessarily correspond to the meaningful order. These ordinal numbers may be reordered, removed, or newly assigned as appropriate.
Also, in the present disclosure, inference includes detection, recognition, identification, and so forth. Also, in the present disclosure, computing includes determination processing, search processing, obtainment processing, derivation processing, extraction processing, and so forth.
Also, in the present disclosure, to distill network NW1 to train network NW2 means, for example, to use network NW1 as a teacher network to train network NW2. Also, in the present disclosure, to train a network means, for example, to adjust parameters of the network. To train a network may be read as to perform learning on the network. Also, a network may be read as an inference model.
The reference network is assumed to be used, for example, in a cloud environment or a GPU environment, and is built using floating-point representation. Meanwhile, the integrated environment network is assumed to be integrated into, for example, an IoT device, and is built using fixed-point representation. Basically, training is first performed in a cloud environment or a GPU environment to build the reference network. After this, the reference network is converted into the integrated environment network.
The integrated environment has limited resources. As such, when the reference network is converted into the integrated environment network, the size of the network is reduced. Such size reduction includes quantization for converting floating-point representation into fixed-point representation. This size reduction results in a decrease in the detection accuracy.
More specifically, as shown in
This causes a possibility that the integrated environment network fails to achieve a desired performance. Another possibility is that disagreement, for example, occurs between the inference results of the reference network and the integrated environment network, which can result in an increase in the number of steps required to evaluate and verify the integrated environment network.
This can achieve the desired performance in the integrated environment network. This also reduces, for example, disagreement between the inference results of the reference network and the integrated environment network, thus reducing the number of steps required to evaluate and verify the integrated environment network.
Network searcher 101 searches for second network 112 that is expected to have better inference accuracy and inference speed as a result of distilling first network 111 serving as a reference network. This process corresponds to obtaining an evaluation value from evaluation value calculator 103 and searching for second network 112 with which an evaluation value is expected to be better.
First network 111 and second network 112 are inference models, such as neural network models, for performing inference processing. The inference speed of second network 112 is expected to be higher than that of first network 111. For this reason, the size of second network 112 to be searched for is assumed to be smaller than the size of first network 111.
Note that the size of a network corresponds to the number of nodes, the number of layers, the number of parameters, the number of inter-node connections, etc. included in the network. The size of the network increases with an increase in the number of nodes, the number of layers, the number of parameters, the number of inter-node connections, etc. included in the network. Alternatively, the size of the network may correspond to any one of the number of nodes, the number of layers, the number of parameters, and the number of inter-node connections included in the network.
Evaluation value calculator 103 calculates an evaluation value. More specifically, evaluation value calculator 103 obtains, for example, the inference result of first network 111 and the inference result of second network 112, and calculates an output value of the loss function relating to inference accuracy and inference speed as the evaluation value. The output value of the loss function increases with an increase in the difference between the inference result of first network 111 and the inference result of second network 112 and with an increase in the processing latency of second network 112. The evaluation value that corresponds to the output value of the loss function is the smaller the better.
To obtain inference results from first network 111 and second network 112, evaluation value calculator 103 may input the same data to first network 111 and second network 112. Alternatively, such data input may be performed by learning processor 104 or by, for example, an inputter or a network controller not illustrated in the drawings.
Learning processor 104 updates second network 112 to improve the inference accuracy and the inference speed. More specifically, learning processor 104 obtains the evaluation value from evaluation value calculator 103 and updates second network 112 to improve the evaluation value.
In particular, learning processor 104 updates second network 112 to improve the evaluation value, thereby performing update to reduce the difference between the inference result of first network 111 and the inference result of second network 112. Stated differently, learning processor 104 distills first network 111 to train second network 112.
Network searcher 101, evaluation value calculator 103, and learning processor 104 repeat the foregoing processes, thereby enabling the obtainment of second network 112 whose inference accuracy and inference speed are better. The search for second network 112 performed by information processing system 100 is a search automatically performed on the basis of the loss function and is a search for a network architecture. Such search is also referred to as automatic search or neural architecture search (NAS).
Next, learning processor 104 trains second network 112 (S102). More specifically, learning processor 104 trains second network 112 to improve an evaluation value obtained from evaluation value calculator 103. This process corresponds to distilling first network 111 to train second network 112.
Next, network searcher 101 searches for second network 112 with which an evaluation value to be obtained is expected to be better as a result of the training (S103).
When the performance of second network 112 satisfies the requirement (Yes in S104), the processing ends. For example, the processing ends when the inference accuracy and the inference speed of second network 112 satisfy the requirement. Meanwhile, when the performance of second network 112 fails to satisfy the requirement (No in S104), the training of second network 112 (S102) and the search for second network 112 (S103) are repeated.
In the second reference example, the architecture of second network 112 to be searched for in the search for second network 112 is the one whose inference accuracy is close to that of first network 111 and whose inference speed is higher than that of first network 111. Second network 112 has higher inference speed than that of first network 111, and thus is assumed to be smaller in size than first network 111.
However, the second reference example fails to take into consideration the size reduction of the integrated environment network. In particular, the second reference example fails to take into consideration quantization. For this reason, second network 112 may not be applicable to an integrated environment network. Also, the size reduction that includes quantization can result in degradation compared to the reference network.
The present embodiment describes an information processing method and so forth capable of finding an inference model that is expected to reduce the foregoing loss. Note that reducing the loss may be read as compensating for the loss.
Network searcher 201 is an information processor that performs information processing. Network searcher 201 sets first network 211 to the initial value for search and searches for second network 212 that is expected to reduce the loss caused by size reduction.
For example, network searcher 201 obtains first network 211 as a reference network from, for example, an external device of information processing system 200. Subsequently, network searcher 201 sets first network 211 at the start position of searching for second network 212. After this, network searcher 201 obtains the evaluation value from evaluation value calculator 203, and searches for second network 212 with which the evaluation value is expected to be better.
The initial value for search to be set may be, for example, the number of nodes, the type of a kernel, etc. Also, network searcher 201 may obtain the settings information indicating the settings for size reduction and difficulty level information indicating the inference difficulty level to determine the initial value for search, on the basis of the settings for size reduction and the inference difficulty level.
Stated differently, network searcher 201 may adjust the initial value for search on the basis of the settings for size reduction and the inference difficulty level. More specifically, network searcher 201 may increase the number of nodes when the settings for size reduction is set at a higher level than the reference. Network searcher 201 may also increase the number of nodes when the inference difficulty level is set at a higher level than the reference. Also, a table may be used for determining the initial value for search on the basis of the settings for size reduction, the inference difficulty level, or a combination of these.
First network 211 and second network 212 are inference models, such as neural network models, for performing inference processing. The network size of second network 212 is larger than the network size of first network 211. With this, it is expected to reduce the loss caused by size reduction.
First network 211 is also a basic network and a reference network. First network 211 uses, for example, floating-point representation. Second network 212 is an intermediate network that is different neither from a reference network nor an integrated environment network. Second network 212 also uses, for example, floating-point representation.
Size reductor 202 is an information processor that performs information processing. Size reductor 202 reduces the size of second network 212, thereby generating third network 213. Size reduction is performed for a plurality of purposes described below.
One of the purposes is to reduce the network execution latency. The execution latency is represented in the unit of milliseconds (ms) or microseconds (μs).
Another of the purposes is to reduce a required amount of computation. The required amount of computation is represented by the number of times operations (or Ops) are performed.
Another of the purposes is to reduce a required amount of memory of a weight storage memory, an intermediate feature storage memory, and so forth. The required amount of memory is represented in the unit of bit numbers.
Another of the purposes is to reduce a required amount of memory transfer. The required amount of memory transfer mainly represents the amount of memory transfer between a processor of a device that performs inference processing and an external DRAM. The required amount of memory transfer is represented in the unit of bit/second. Note, however, that the memory to which the processor transfer memory is not limited to an external DRAM.
Another of the purposes is to reduce electric power consumption and electric energy consumption. Electric power consumption is represented by watt (W) or milliwatt (mW) and electric energy consumption is represented by watt-hour (Wh). Electric power consumption and electric energy consumption are determined by a combination of elements such as hardware to be implemented that performs inference processing, the required amount of calculation, and the required amount of memory transfer.
Another of the purposes is to reduce the size of a device into which a neural network model, a deep learning model, or a machine learning model is to be integrated. The index of device size relating to the reduction of device size is represented by cubic centimeter (cm3) or cubic millimeter (mm3). The device size is determined by a combination of elements such as the electric power consumption of the device, the amount of heat capacity of the device, the network execution latency required for the device, and the sizes of components of the device.
Note that size reduction may not be intended for all of the foregoing purposes, and thus some of these may be purposes for size reduction.
The size reduction performed for the foregoing purposes basically includes quantization. Quantization may be, for example, quantization for changing floating-point representation to fixed-point representation. Note that quantization is not limited to quantization for changing floating-point representation to fixed-point representation, and thus may be quantization for changing the representation form to a representation form that requires a smaller number of bits.
For example, by quantizing the network, the representation forms of parameters, input data values, intermediate data values, output data values, and so forth of the network are changed to be representation forms that require a smaller number of bits. Not all but some of the representation forms of parameters, input data values, intermediate data values, output data values, and so forth of the network may be changed to be representation forms that require a smaller number of bits.
Size reduction may include the reduction of the network size such as the reduction in the number of layers, the reduction in the number of nodes, the reduction in the number of inter-node connections, and so forth. Also, size reduction may include only quantization. Stated differently, size reduction may be quantization. Alternatively, size reduction that includes no quantization may be used.
Also, for example, size reductor 202 obtains the settings information for size reduction, and reduces the size of second network 212 on the basis of the settings information. The settings information may include the volume of quantization bit numbers (in other words, degree of quantization), the amount of reducing the number of layers, the amount of reducing the number of nodes, the amount of reducing the number of inter-node connections, and so forth. Here, a small number of quantization bits (in other words, degree of quantization is large) corresponds to a high quantization level, and a large number of quantization bits (in other words, degree of quantization is small) corresponds to a low quantization level. The settings information for size reduction may be stored in a memory, etc. as size reduction settings 231.
Third network 213 is an inference model, such as a neural network model, for performing inference processing. The network size of third network 213 is basically larger than the network size of first network 211. However, the present disclosure is not limited to such configuration, and thus the network size of third network 213 may be smaller than the network size of first network 211.
Third network 213 is an integrated environment network. More specifically, third network 213 is a network that is expected to achieve the purposes of size reduction. Third network 213 uses, for example, fixed-point representation.
Evaluation value calculator 203 is an information processor that performs information processing. Evaluation value calculator 203 calculates an evaluation value. More specifically, evaluation value calculator 203 calculates the inference accuracy of third network 213 as an evaluation value.
For example, evaluation value calculator 203 obtains the inference result of first network 211 and the inference result of third network 213 from the same input data. Evaluation value calculator 203 may then calculate the difference between the inference result of first network 211 and the inference result of third network 213 as the evaluation value. Alternatively, evaluation value calculator 203 may calculate the difference between correct answer data 232 stored in, for example, the memory and the inference result of third network 213 as the evaluation value. The evaluation value is the smaller the better.
To obtain inference results from first network 211 and second network 212, evaluation value calculator 203 may input the same data to first network 211 and second network 212. Alternatively, such data input may be performed by learning processor 204 or by, for example, an inputter or a network controller not illustrated in the drawings.
The evaluation value calculated by evaluation value calculator 203 is used, for example, in network searcher 201 and learning processor 204. Evaluation value calculator 203 may calculate the evaluation value used in network searcher 201 and the evaluation value used in learning processor 204 on the basis of different criteria.
More specifically, evaluation value calculator 203 may calculate, for example, the difference between the inference result of first network 211 and the inference result of third network 213 as the evaluation value used in network searcher 201. Evaluation value calculator 203 may then calculate the difference between correct answer data 232 and the inference result of third network 213 as the evaluation value used in learning processor 204.
Learning processor 204 is an information processor that performs information processing. Learning processor 204 updates third network 213 to improve the inference accuracy of third network 213. More specifically, learning processor 204 obtains the evaluation value from evaluation value calculator 203, and updates third network 213 to improve the evaluation value.
Learning processor 204 may update third network 213 to reduce the difference between the inference result of first network 211 and the inference result of third network 213, in accordance with the evaluation value that corresponds to the difference between the inference result of first network 211 and the inference result of third network 213. Stated differently, learning processor 204 may distill first network 211 to train third network 213.
Alternatively, learning processor 204 may perform adversarial learning on third network 213. The adversarial learning may be performed on the basis of a comparison between the inference result of first network 211 and the inference result of third network 213 or may be performed on the basis of a comparison between correct answer data 232 and the inference result of third network 213. Alternatively, learning processor 204 may perform metric learning on third network 213.
Also, learning processor 204 may change size reduction settings 231 in accordance with, for example, the evaluation value. Stated differently, learning processor 204 may change size reduction settings 231 in accordance with, for example, the inference accuracy of third network 213. For example, learning processor 204 may more decrease the degree to which size reduction is performed as the inference accuracy of third network 213 is lower.
Network searcher 201, size reductor 202, evaluation value calculator 203, and learning processor 204 repeat the foregoing processes, thereby enabling the obtainment of third network 213, as an integrated environment network, having a better inference accuracy. The inference speed of third network 213 is high because its size has been reduced. Learning processor 204 or other elements may output the finally obtained third network 213 as an integrated environment network.
Information processing system 200 may also include difficulty level calculator 205. Difficulty level calculator 205 is an information processor that performs information processing. Difficulty level calculator 205 calculates the inference difficulty level on the basis of dataset 233. The difficulty level may be stored, for example, in the memory as task difficulty level 234.
For example, difficulty level calculator 205 may calculate the inference difficulty level on the basis of the data amount, type, and so forth of dataset 233 used for inference. Difficulty level calculator 205 may calculate the difficulty level on the basis of not dataset 233 but the type of inference. For example, the difficulty level of detection processing may be higher than the difficulty level of identification processing.
Information processing system 200 may include learning processor 206. Learning processor 206 updates second network 212 to improve the inference accuracy of second network 212. More specifically, learning processor 206 may train second network 212, using correct answer data 232 as teacher data. Alternatively, learning processor 206 may distill first network 211 to train second network 212.
Alternatively, learning processor 206 may perform adversarial learning on second network 212. The adversarial learning may be performed on the basis of a comparison between the inference result of first network 211 and the inference result of second network 212 or may be performed on the basis of a comparison between correct answer data 232 and the inference result of second network 212. Alternatively, learning processor 206 may perform metric learning on second network 212.
First, network searcher 201 obtains first network 211 and sets first network 211 to the initial value for searching for second network 212 (S201). Next, network searcher 201 obtains the settings information indicating the settings for size reduction and the difficulty level information indicating the inference difficulty level (S202). Network searcher 201 then determines the initial value for searching for second network 212 on the basis of the settings for size reduction and the inference difficulty level (S203).
For example, network searcher 201 defines first network 211 as second network 212. Subsequently, network searcher 201 adjusts second network 212 on the basis of the settings for size reduction and the inference difficulty level to determine second network 212. Through this, second network 212 for the initial stage of search is determined.
Subsequently, learning processor 206 trains second network 212 (S204). Note that this process may be omitted.
Next, size reductor 202 reduces the size of second network 212 on the basis of the settings for size reduction to generate third network 213 (S205).
Next, learning processor 204 trains third network 213 (S206). More specifically, learning processor 204 trains third network 213 to improve an evaluation value obtained from evaluation value calculator 203. This process may correspond to distilling first network 211 to train third network 213 or may correspond to other trainings.
When the performance of third network 213 satisfies the requirement (Yes in S207), the processing ends. For example, the processing ends when the inference accuracy and the inference speed of third network 213 satisfy the requirement. Meanwhile, when the performance of third network 213 fails to satisfy the requirement (No in S207), network searcher 201 changes the number of nodes of each layer or a specified layer (S208). Then, the processes are repeated from the training of second network 212 (S204).
The foregoing inference accuracy may include accuracy and correctness. Further, a correct answer agreement rate, which is an agreement rate of correct answer data 232 and the inference result of third network 213, may be used as the correctness. Alternatively, a reference agreement rate, which is an agreement rate of the inference result of first network 211 and the inference result of third network 213, may be used as the correctness. Also, the foregoing inference speed may correspond to the processing time taken for inference.
For example, the foregoing requirement may be that the inference accuracy of third network 213 is higher than or equal to the reference or that the inference speed of third network 213 is higher than or equal to the reference. More specifically, the foregoing requirement may be that the correct answer agreement rate is 90% or higher. Alternatively, the foregoing requirement may also be that the reference agreement rate is 98% or higher. The foregoing requirement may also be that the processing time is 20 ms or less. Alternatively, the foregoing requirement may be a combination of these.
Network searcher 201 may determine whether the performance of third network 213 satisfies the requirement, in accordance with the evaluation value obtained from evaluation value calculator 203. Also, network searcher 201 may change the number of nodes of each layer or a specified layer to improve the evaluation value obtained from evaluation value calculator 203.
The evaluation value obtained from evaluation value calculator 203 may indicate, for example, the inference accuracy of third network 213. The inference accuracy of third network 213 may indicate the difference between the inference result of third network 213 and the inference result of first network 211, or may indicate the difference between the inference result of third network 213 and correct answer data 232. When the inference accuracy of third network 213 is poor, network searcher 201 may then increase the number of nodes of each layer or a specified layer.
Thereafter, the training of second network 212 (S204), the size reduction of second network 212 (S205), the training of third network 213 (S206), and the change in the number of nodes (S208) are repeated until the performance of third network 213 satisfies the requirement. Through this, third network 213 whose performance satisfies the requirement is obtained. Learning processor 204 or other elements may output the finally obtained third network 213.
Next, learning processor 206 trains second network 212 as an optional operation (S304). Subsequently, size reductor 202 reduces the size of second network 212 on the basis of the settings for size reduction to generate third network 213 (S305). Learning processor 204 then trains third network 213 (S306). The processes up until here are the same as those of the first operation example in the present embodiment.
Next, when the performance of third network 213 fails to satisfy a first requirement (No in S307), network searcher 201 changes the number of layers (S308). Then, the processes are repeated from the training of second network 212 (S304).
Network searcher 201 may determine whether the performance of third network 213 satisfies the first requirement, in accordance with the evaluation value obtained from evaluation value calculator 203. Network searcher 201 may also change the number of layers to improve the evaluation value obtained from evaluation value calculator 203.
The evaluation value obtained from evaluation value calculator 203 may indicate, for example, the inference accuracy of third network 213. The inference accuracy of third network 213 may indicate the difference between the inference result of third network 213 and the inference result of first network 211, or may indicate the difference between the inference result of third network 213 and correct answer data 232. When the inference accuracy of third network 213 is poor, network searcher 201 may increase the number of layers.
Meanwhile, when the performance of third network 213 satisfies the first requirement and fails to satisfy a second requirement (Yes in S307 and No in S309), network searcher 201 changes the number of nodes of each layer or a specified layer (S310). This process is the same as that of the first operation example in the present embodiment. Subsequently, the processes are repeated from the training of second network 212 (S304).
Thereafter, the training of second network 212 (S304), the size reduction of second network 212 (S305), the training of third network 213 (S306), the change in the number of layers (S308), and the change in the number of nodes (S310) are repeated until the first requirement and the second requirement are satisfied. When the performance of third network 213 satisfies the first requirement and the second requirement (Yes in S307 and Yes in S309), the processing ends. Through this, third network 213 whose performance satisfies the first requirement and the second requirement is obtained.
The second requirement is, for example, a higher requirement than the first requirement. Stated differently, the second requirement is more stringent than the first requirement. It is assumed that the performance is more affected by the change in the number of layers than the change in the number of nodes. In view of this, the number of layers is changed to satisfy the first requirement that is looser, after which the number of nodes is changed to satisfy the second requirement that is more stringent. With this, third network 213 whose performance satisfies the stringent second requirement is assumed to be found at an early stage.
As described above, information processing system 200 in the present embodiment searches for second network 212 that is larger than first network 211. Information processing system 200 then reduces the size of second network 212 to generate third network 213 and trains third network 213.
With this, third network 213 can be found that is expected to reduce a loss caused by quantization included in the size reduction.
Note that information processing system 200 may include some of the elements described in the present embodiment and may perform some of the processes described in the present embodiment. Also, at least some of the elements and processes described in the present embodiment may be combined with at least some of the elements and processes described in another embodiment.
A configuration example of the present embodiment is the same as the configuration example shown in
L(x)=CE(x,TargetNetQuant)·α·log(LAT(TargetNetQuant))β·γ·Diff(RefNet(x),TargetNetQuant(x))δ+λ·R_size(SizeDiff(RefNet,TargetNet))
Here, x represents an input to a network. RefNet represents a reference network (i.e., first network 211). TargetNet represents an intermediate network (i.e., second network 212). TargetNetQuant represents an integrated environment network (i.e., third network 213).
CE(x, TargetNetQuant) is a cross entropy term which represents the difference between the inference result of the integrated environment network and correct answer data 232. CE is a cross entropy function. This term has the same properties as those of the normal training.
α·log(LAT(TargetNetQuant))β is a latency term and a term relating to the execution speed of the integrated environment network. LAT is a function representing the amount of latency. LAT(TargetNetQuant) represents the amount of latency of the integrated environment network. Also, α and β are coefficients for adjusting an output value of the latency term.
γ·Diff(RefNet(x), TargetNetQuant (x))δ is a reference integration equivalent term and is a term representing the difference between the reference network and the integrated environment network. Diff is a function representing the difference. Diff(RefNet(x), TargetNetQuant (x)) represents the difference between the inference result of the reference network and the inference result of the integrated environment network.
An absolute value difference, a square absolute value difference, a Euclidean distance, or a cosine distance, for example, may be used as the foregoing difference. γ and δ are coefficients for adjusting an output value of the reference integration equivalent term.
λ·R_size(SizeDiff(RefNet, TargetNet)) is a constraint term of the size difference. SizeDiff is a function representing the size difference. SizeDiff(RefNet, TargetNet) represents the size difference between the reference network and the intermediate network.
The difference in the number of parameters, the difference in the number of channels, or the difference in the number of layers, for example, may be used as the size difference. Alternatively, the difference in the number of parameters that are assigned weights on a layer-by-layer basis, where a smaller weight is assigned to an important layer, for example, may be used as the size difference.
R_size is a size difference evaluation function for evaluating the size difference. R_size(SizeDiff(RefNet, TargetNet)) decreases with an increase in the size difference between the reference network and the intermediate network and increases with a decrease in the size difference between the reference network and the intermediate network.
The size of the intermediate network is set to be larger than the reference network. As such, R_size(SizeDiff(RefNet, TargetNet)) is smaller as the intermediate network is relatively large with respect to the reference network. Also, A is a coefficient for adjusting an output value of the constraint term of the size difference.
Evaluation value calculator 203 performs an operation of the foregoing loss function. Evaluation value calculator 203 then outputs an output value of the loss function as the evaluation value. Network searcher 201 obtains the output value of the loss function from evaluation value calculator 203 as the evaluation value, and searches for second network 212 with which an output value of the loss function is smaller.
Learning processor 204 may also obtain an output value of the loss function from evaluation value calculator 203 as the evaluation value, and train third network 213 to decrease an output value of the loss function. Alternatively, evaluation value calculator 203 may calculate a different evaluation value used in learning processor 204 from the evaluation value used in network searcher 201. Learning processor 204 may then train third network 213 on the basis of such different evaluation value from the output value of the loss function.
Also, evaluation value calculator 203 may set γ, δ, λ, ε, and θ on the basis of the settings for size reduction and the inference difficulty level. For example, evaluation value calculator 203 may adjust these coefficients on the basis of whether the settings for size reduction is positive or negative.
In relation to pruning for reducing the number of nodes, positive size reduction means that the rate of reducing the number of nodes is high, whereas negative size reduction means that the rate of reducing the number of nodes is low. In relation to quantization, positive size reduction means that the number of quantization bits is small and the quantization level is high, whereas negative size reduction means that the number of quantization bits is large and the quantization level is low. Positive size reduction may also mean that the rate of reducing the number of layers is high, whereas negative size reduction may mean that the rate of reducing the number of layers is low.
It is assumed, for example, that positive size reduction results in a decrease in the inference accuracy. In view of this, when size reduction is positive, evaluation value calculator 203 changes γ, δ, λ, ε, and θ of the loss function to increase an output value of the loss function. Then, second network 212 is searched for to cause an output value of the loss function to be less than or equal to the threshold. This can alleviate the decrease in the inference accuracy.
Alternatively, when size reduction is positive, γ and δ are set to be larger to increase a weight of the difference between the inference result of the reference network and the inference result of the integrated environment network. In this case, λ and θ are set to be larger and ε is set to be smaller to increase a weight of the size difference.
With this, an output value of the loss function decreases with a decrease in the difference between the inference result of the reference network and the inference result of the integrated environment network and with an increase in the size of the intermediate network.
Subsequently, network searcher 201 searches for an intermediate network with which an output value of the loss function is smaller. Stated differently, network searcher 201 searches for an intermediate network to decrease the difference between the inference result of the reference network and the inference result of the integrated environment network and increase the size of the intermediate network. This alleviates the decrease in the inference accuracy. Also, even when the size of the intermediate network is large, positive size reduction alleviates the decrease in the inference speed.
It is assumed that negative size reduction does not result in a significant decrease in the inference accuracy. For this reason, the foregoing settings are performed in the other way around. When size reduction is negative, for example, evaluation value calculator 203 changes γ, δ, λ, ε, and θ of the loss function to decrease an output value of the loss function. Alternatively, in the same case, γ and δ are set to be smaller to decrease a weight of the difference between the inference result of the reference network and the inference result of the integrated environment network. Also in this case, λ and θ are set to be smaller and ε is set to be larger to decrease a weight of the size difference.
It is assumed that a high inference difficulty level results in a decrease in the inference accuracy. In view of this, when the inference difficulty level is high, evaluation value calculator 203 changes γ, δ, λ, ε, and θ of the loss function to increase an output value of the loss function. Then, second network 212 is searched for to cause an output value of the loss function to be less than or equal to the threshold. This can alleviate the decrease in the inference accuracy.
Alternatively, when the inference difficulty level is high, γ and δ are set to be larger to increase a weight of the difference between the inference result of the reference network and the inference result of the integrated environment network. Also in this case, λ and θ are set to be larger and ε is set to be smaller to increase a weight of the size difference. This alleviates the decrease in the inference accuracy.
It is assumed that a low inference difficulty level does not result in a significant decrease in the inference accuracy. For this reason, the foregoing settings are performed in the other way around. When the inference difficulty level is low, for example, evaluation value calculator 203 changes γ, δ, λ, ε, and θ of the loss function to decrease an output value of the loss function. Alternatively, in the same case, γ and δ are set to be smaller to decrease a weight of the difference between the inference result of the reference network and the inference result of the integrated environment network. Also in this case, λ and θ are set to be smaller and ε is set to be larger to decrease a weight of the size difference.
Note that on the basis of the settings for size reduction and the inference difficulty level, evaluation value calculator 203 may adjust at least one coefficient among γ, δ, λ, ε, and θ, with the other coefficients remaining as the initial values. For example, evaluation value calculator 203 may adjust ε among ε and θ included in R_size function, and maintain θ. Alternatively, evaluation value calculator 203 may adjust only λ, θ, and ε, for example, that relate to the size difference, among γ, δ, λ, ε, and θ.
First, network searcher 201 obtains first network 211 and sets first network 211 to the initial value for searching for second network 212 (S401). Next, network searcher 201 obtains the settings information indicating the settings for size reduction and the difficulty level information indicating the inference difficulty level (S402). Network searcher 201 then determines the initial value for searching for second network 212 on the basis of the settings for size reduction and the inference difficulty level (S403).
The processes up until here are the same as those of the first operation example and the second operation example in Embodiment 1.
Next, evaluation value calculator 203 sets coefficients of the loss function (S404). More specifically, evaluation value calculator 203 may obtain the settings information indicating the settings for size reduction and the difficulty level information indicating the inference difficulty level, and set the coefficients of the loss function on the basis of the settings for size reduction and the inference difficulty level.
Next, learning processor 206 trains second network 212 as an optional operation (S405). Subsequently, size reductor 202 reduces the size of second network 212 on the basis of the settings for size reduction to generate third network 213 (S406). Learning processor 204 then trains third network 213 (S407). These processes are the same as those of the first operation example and the second operation example in Embodiment 1.
Next, network searcher 201 searches for second network 212 with which an evaluation value to be obtained is expected to be better as a result of the training (S408). More specifically, network searcher 201 searches for a new second network 212 with which an output value of the foregoing loss function is expected to be smaller. Stated differently, network searcher 201 adjusts, for example, the number of layers and the number of nodes of second network 212 to decrease an output value of the foregoing loss function.
Note that the processes may be repeated from the training of second network 212 (S405) to the search for second network 212 (S408) before the determination of whether the performance of third network 213 satisfies the requirement (S409).
When the performance of third network 213 satisfies the requirement (Yes in S409), the processing ends. For example, the processing ends when the inference accuracy and the inference speed of third network 213 satisfy the requirement. Meanwhile, when the performance of third network 213 fails to satisfy the requirement (No in S409), the processes are repeated from the training of second network 212 (S405) to the search for second network 212 (S408). Through this, third network 213 whose performance satisfies the requirement is obtained.
The requirement shown in Embodiment 1 may be used as the foregoing requirement. The foregoing requirement may be, for example, that the correct answer agreement rate is 90% or higher, the reference agreement rate is 98% or higher, and the processing time is 20 ms or less. Alternatively, the output value of the loss function may be used as an index of the performance of third network 213, and the foregoing requirement may be that the output value of the loss function is less than or equal to the threshold.
First, network searcher 201 obtains first network 211 and sets first network 211 to the initial value for searching for second network 212 (S501). Next, network searcher 201 obtains the settings information indicating the settings for size reduction and the difficulty level information indicating the inference difficulty level (S502). Network searcher 201 then determines the initial value for searching for second network 212 on the basis of the settings for size reduction and the inference difficulty level (S503).
Next, evaluation value calculator 203 sets coefficients of the loss function (S504). Next, learning processor 206 trains second network 212 as an optional operation (S505). Subsequently, size reductor 202 reduces the size of second network 212 on the basis of the settings for size reduction to generate third network 213 (S506). Learning processor 204 then trains third network 213 (S507). Next, network searcher 201 searches for second network 212 (S508).
The processes up until here are the same as those of the first operation example in Embodiment 1. Note that, as with the first operation example, the processes may be repeated from the training of second network 212 (S505) to the search for second network 212 (S508) before the determination of whether the performance of third network 213 satisfies the first requirement (S509).
Next, when the performance of third network 213 fails to satisfy the first requirement (No in S509), learning processor 204 makes negative change in size reduction settings 231 (S510). Stated differently, learning processor 204 changes the size reduction to be applied to second network 212 to negative size reduction. More specifically, learning processor 204 may increase the number of quantization bits or may decrease the rate of reducing the number of layers and the number of nodes.
Then, the processes are repeated from the setting of the coefficients of the loss function (S504). Note that in the setting of the coefficients of the loss function (S504), the coefficients of the loss function are set on the basis of size reduction settings 231 that have been changed.
Also, when the performance of third network 213 satisfies the first requirement and fails to satisfy the second requirement (Yes in S509 and No in S511), the processes are repeated from the setting of the coefficients of the loss function (S504). In this case, size reduction settings 231 are not to be changed. The processes may thus be repeated from the training of second network 212 (S505).
When the performance of third network 213 satisfies the first requirement and the second requirement, and fails to satisfy a third requirement (Yes in S509, Yes in S511, and No in S512), learning processor 204 makes positive change in size reduction settings 231 (S513). Stated differently, learning processor 204 changes the size reduction to be applied to second network 212 to positive size reduction. More specifically, learning processor 204 may decrease the number of quantization bits or increase the rate of reducing the number of layers and the number of nodes.
Then, the processes are repeated from the setting of coefficients of the loss function (S504). Note that in the setting of the coefficients of the loss function (S504), the coefficients of the loss function are set on the basis of size reduction settings 231 that have been changed.
Then, the processes are repeated from the setting of the coefficients of the loss function (S504) to the search for second network 212 (S508) until the performance of third network 213 satisfies the first requirement, the second requirement, and the third requirement. When the performance of third network 213 satisfies the first requirement, the second requirement, and the third requirement (Yes in S509, Yes in S511, and Yes in S512), the processing ends. Through this, third network 213 whose performance satisfies the first requirement, the second requirement, and the third requirement is obtained.
The second requirement is, for example, a higher requirement than the first requirement. Stated differently, the second requirement is more stringent than the first requirement. In view of this, size reduction settings 231 are changed to satisfy the first requirement that is looser, after which second network 212 is searched for to satisfy the second requirement that is more stringent. With this, third network 213 whose performance satisfies the stringent second requirement is assumed to be found at an early stage.
Also, the first requirement and the second requirement may be requirements for the inference accuracy of third network 213, while the third requirement may be a requirement for the inference speed of third network 213.
In this case, second network 212 is first searched for regarding the inference accuracy of third network 213. On the basis of the result of the search performed regarding the inference accuracy, a search is then performed for second network 212 regarding the inference speed of third network 213, together with positive change in size reduction settings 231. With this, third network 213 that satisfies a plurality of requirements for the inference accuracy and the inference speed can be efficiently found.
More specifically, the first requirement may be, for example, that the correct answer agreement rate is 70% or higher and the reference agreement rate is 90% or higher. The second requirement may be that the correct answer agreement rate is 80% or higher and the reference agreement rate is 95% or higher. The third requirement may be that the processing time is 20 ms or less.
Also, in the foregoing operation, the processes are repeated from the setting of the coefficients of the loss function (S504) to the search for second network 212 (S508). However, the processes may be repeated from the determination of the initial value for search (S503) to the search for second network 212 (S508). Alternatively, the processes may be repeated from the training of second network 212 (S505) to the search for second network 212 (S508).
As described above, information processing system 200 in the present embodiment uses the loss function to search for second network 212. Such loss function is a loss function for finding second network 212 and third network 213 that are expected to reduce a loss caused by quantization included in size reduction. With this, it is possible to efficiently find an integrated environment network. Third network 213 may be possibly found.
Note that information processing system 200 may include some of the elements described in the present embodiment and may perform some of the processes described in the present embodiment. Also, at least some of the elements and processes described in the present embodiment may be combined with at least some of the elements and processes described in another embodiment.
In Embodiment 1 and Embodiment 2, first network 211 is provided, for example, from the third party, and thus the parameters included in first network 211 are fixed without being changed.
The present embodiment is not limited to the above, and allows for a change in the parameters included in first network 211. The present embodiment also performs an operation for causing the inference result of first network 211 and the inference result of third network 213 to be close to each other. With this, it is expected that a similar inference result is obtained in a device regardless of whether such device uses float-point representation or fixed-point representation.
More specifically, information processing system 200 in the present embodiment changes the parameters of first network 211 without fixing the parameters of first network 211, thereby generating third network 213 capable of obtaining an inference result that highly agrees with the inference result of first network 211.
Even more specifically, information processing system 200 distills second network 212 to train first network 211. Information processing system 200 also reduces the size of second network 212, thereby generating third network 213.
Subsequently, information processing system 200 trains first network 211 and third network 213 that have the same parent, that is, second network 212, so that the inference results thereof are close to each other. Any one of distilling learning, adversarial learning, or metric learning may be used for the training of first network 211 and third network 213.
In the present embodiment, first network 211 and third network 213 have the same parent. As such, it is expected that a reference agreement rate improves compared to Embodiment 1 and Embodiment 2.
The operation performed by information processing system 200 in the present embodiment is divided into three phases: a first phase, a second phase, and a third phase.
In the first phase, second network 212 having a large size is trained. In the second phase, second network 212 is distilled to train first network 211. In the third phase, the size of second network 212 is reduced to generate third network 213, and first network 211 is distilled to train third network 213. In the third phase, second network 212 may be distilled to train third network 213.
More specifically, in the first phase, learning processor 206 trains second network 212. Note that the first phase may be omitted.
In the second phase, evaluation value calculator 203 obtains the inference result of first network 211 and the inference result of second network 212 to calculate an evaluation value indicating the difference between the inference result of first network 211 and the inference result of second network 212.
Subsequently, learning processor 204 trains first network 211 on the basis of the evaluation value indicating the difference between the inference result of first network 211 and the inference result of second network 212. More specifically, learning processor 204 trains first network 211 to reduce the difference between the inference result of first network 211 and the inference result of second network 212.
In the third phase, size reductor 202 reduces the size of second network 212 to generate third network 213. Subsequently, evaluation value calculator 203 obtains the inference result of first network 211 and the inference result of third network 213 to calculate an evaluation value indicating the difference between the inference result of first network 211 and the inference result of third network 213.
Subsequently, learning processor 204 trains third network 213 on the basis of the evaluation value indicating the difference between the inference result of first network 211 and the inference result of third network 213. More specifically, learning processor 204 trains third network 213 to reduce the difference between the inference result of first network 211 and the inference result of third network 213.
In the third phase, evaluation value calculator 203 may obtain the inference result of second network 212 and the inference result of third network 213 to calculate an evaluation value indicating the difference between the inference result of second network 212 and the inference result of third network 213.
Subsequently, learning processor 204 may train third network 213 on the basis of the evaluation value indicating the difference between the inference result of second network 212 and the inference result of third network 213. More specifically, learning processor 204 may train third network 213 to reduce the difference between the inference result of second network 212 and the inference result of third network 213.
Second network 212 has a high representation capability. Information of second network 212 having a high representation capability is reflected in first network 211 and third network 213. As such, first network 211 and third network 213 are expected to achieve a similar inference accuracy.
Note that, instead of learning processors 204 and 206, information processing system 200 may include three learning processors that correspond to the respective three networks. Stated differently, information processing system 200 may include a first network learning processor that trains first network 211, a second network learning processor that trains second network 212, and a third network learning processor that trains third network 213. Information processing system 200 may also include three evaluation value calculators that correspond to these.
First, network searcher 201 obtains first network 211 and sets first network 211 to the initial value for searching for second network 212 (S601).
Note that network searcher 201 may generate first network 211, thereby obtaining first network 211. More specifically, network searcher 201 may determine an expected size of first network 211 on the basis of the design requirements, for example, to determine the architecture of first network 211. Network searcher 201 may then generate first network 211 on the basis of the determined architecture.
Next, network searcher 201 obtains the settings information indicating the settings for size reduction and the difficulty level information indicating the inference difficulty level (S602). Network searcher 201 then determines the initial value for searching for second network 212 on the basis of the settings for size reduction and the inference difficulty level (S603). Subsequently, learning processor 206 trains second network 212 (S604). These processes are the same as those of the first operation example in the present embodiment.
Next, when the performance of second network 212 fails to satisfy the first requirement (No in S605), network searcher 201 changes the number of layers (S606). Subsequently, the processes are repeated from the training of second network 212 (S604).
When the performance of second network 212 satisfies the first requirement and fails to satisfy the second requirement (Yes in S605 and No in S607), network searcher 201 changes the number of nodes of each layer or a specified layer (S608). Subsequently, the processes are repeated from the training of second network 212 (S604).
Then, the training of second network 212 (S604), the change in the number of layers (S606), and the change in the number of nodes (S608) are repeated until the performance of second network 212 satisfies the first requirement and the second requirement. Through this, second network 212 whose performance satisfies the first requirement and the second requirement is obtained.
These processes (S605, S606, S607, and S608) correspond to the processes (S307, S308, S309, and S310) of the second operation example in Embodiment 1. Note, however, that the determination about the performance of third network 213 is performed in the processes of the second operation example in Embodiment 1, but a determination about the performance of second network 212 is performed in the processes of the first phase of the present operation example. Then, when the performance of second network 212 satisfies the first requirement and the second requirement (Yes in S605 and Yes in S607), the processing of the first phase ends.
In the second phase, learning processor 204 distills second network 212 to train first network 211 (S609).
More specifically, evaluation value calculator 203 obtains the inference result of first network 211 and the inference result of second network 212 to calculate an evaluation value indicating the difference between the inference result of first network 211 and the inference result of second network 212. With reference to the calculated evaluation value, learning processor 204 trains first network 211 to reduce the difference between the inference result of first network 211 and the inference result of second network 212.
After that, when the performance of first network 211 fails to satisfy the third requirement (No in S610), learning processor 204 changes the number of nodes of first network 211 (S611). More specifically, when the inference accuracy of first network 211 is not higher than or equal to the reference, learning processor 204 increases the number of nodes of first network 211. With this, the inference accuracy of first network 211 is expected to be improved. Then, the processes of the first phase are repeated from the obtainment of the settings information for size reduction and the difficulty level information of inference (S602).
When the performance of first network 211 satisfies the third requirement (Yes in S610), size reductor 202 reduces the size of second network 212 on the basis of the settings for size reduction to generate third network 213 (S612).
Subsequently, learning processor 204 distills second network 212 to train third network 213 (S613). Note that this process may be omitted.
More specifically, evaluation value calculator 203 obtains the inference result of second network 212 and the inference result of third network 213 to calculate an evaluation value indicating the difference between the inference result of second network 212 and the inference result of third network 213. With reference to the calculated evaluation value, learning processor 204 trains third network 213 to reduce the difference between the inference result of second network 212 and the inference result of third network 213.
Next, learning processor 204 distills first network 211 to train third network 213 (S614).
More specifically, evaluation value calculator 203 obtains the inference result of first network 211 and the inference result of third network 213 to calculate an evaluation value indicating the difference between the inference result of first network 211 and the inference result of third network 213. With reference to the calculated evaluation value, learning processor 204 trains third network 213 to reduce the difference between the inference result of first network 211 and the inference result of third network 213.
After that, when the performance of third network 213 fails to satisfy a fourth requirement (No in S615), the processes of the first phase are repeated from the beginning (S601). When the performance of third network 213 satisfies the fourth requirement (Yes in S615), the processing ends. Through this, third network 213 whose performance satisfies the fourth requirement is obtained.
Each of the foregoing performances may be the inference accuracy, the inference speed, or a combination of these as with Embodiment 1 and Embodiment 2. Also, the foregoing requirements are requirements for these performances.
As described above, first network 211 and third network 213 in the present embodiment are based on the same parent. In particular, information processing system 200 in the present embodiment distills second network 212 to train first network 211. As such, a reference agreement rate is expected to improve compared to Embodiment 1 and Embodiment 2.
Note that information processing system 200 may include some of the elements described in the present embodiment and may perform some of the processes described in the present embodiment. Also, at least some of the elements and processes described in the present embodiment may be combined with at least some of the elements and processes described in another embodiment.
The following describes a basic implementation example and a basic operation example relating to Embodiment 1, Embodiment 2, and Embodiment 3.
Processor 301 is an information processing circuit that performs information processing. Processor 301 may serve as network searcher 201, size reductor 202, evaluation value calculator 203, learning processor 204, difficulty level calculator 205, and learning processor 206. Processor 301 may also serve as these units by reading a program from memory 302 and executing the program. Also, processor 301 may control the inference processing performed by first network 211, second network 212, and third network 213.
Memory 302 is a storage device for storing information, and can be also referred to as a recording medium. Memory 302 may store information such as size reduction settings 231, correct answer data 232, dataset 233, and task difficulty level 234. Memory 302 may also store a program used by processor 301 to execute information processing. Memory 302 may also store information of first network 211, second network 212, and third network 213.
Information processing system 200 is, for example, a computer. Information processing system 200 may be a single information processing device or may be configured by a plurality of information processing devices.
First, processor 301 obtains the first inference model serving as a reference (S701). Processor 301 then computes the second inference model that is larger than the first inference model in model size, on the basis of the first inference model (S702). Subsequently, processor 301 quantizes the computed second inference model to generate the third inference model (S703).
Next, processor 301 trains the third inference model, using machine learning (S704). Processor 301 then determines whether the performance of the trained third inference model satisfies a condition (S705). Subsequently, processor 301 outputs the trained third inference model when the performance satisfies the condition (S706).
With this, the second inference model that is larger than the first inference model in model size is quantized. It is assumed that the performance of the second inference model having a large model size is less subjected to a decrease even after being quantized. Stated differently, the third inference model that is generated by quantizing the second inference model that is larger than the first inference model in model size is assumed to be subjected to a relatively small loss caused by quantization. This thus enables to find an inference model that is expected to reduce a loss caused by quantization.
Also, for example, processor 301 may obtain settings information indicating settings for quantization performed on the second inference model. Processor 301 may then set an initial value used for the computing of the second inference model, on the basis of the settings information and the first inference model.
With this, the computing of the second inference model is started on the basis of the settings information for quantization and the first inference model. This thus enables to find, at an early stage, the third inference model that is based on quantization and the first inference model.
Also, for example, processor 301 may obtain difficulty level information indicating an inference difficulty level of at least one of the first inference model, the second inference model, or the third inference model. Processor 301 may then set an initial value for the computing of the second inference model, based on the difficulty level information and the first inference model.
With this, the computing of the second inference model is started on the basis of the difficulty level information of inference and the first inference model. This thus enables to find, at an early stage, the third inference model that is based on the inference difficulty level and the first inference model.
Also, for example, the computing of the second inference model may be a search for the second inference model performed using a loss function. The loss function may be a function whose output value decreases with a decrease in a difference between an inference result of the first inference model and an inference result of the third inference model, and whose output value decreases with an increase in the model size of the second inference model relative to the first inference model. The search for the second inference model may then be performed to cause the output value of the loss function to decrease.
This enables to find an inference model that is expected reduce a loss caused by quantization, on the basis of the loss function.
Also, for example, processor 301 may obtain settings information indicating settings for the quantizing of the second inference model. Processor 301 may then change the loss function, based on the settings information.
This enables to find an inference model that is expected to reduce a loss caused by quantization, on the basis of the loss function that is based on the settings for quantization.
Also, for example, the loss function may be changed to increase the output value of the loss function with an increase in a degree of the quantizing in the settings indicated by the settings information. The search for the second inference model may then be performed to cause the output value of the loss function to be less than or equal to a threshold.
With this, although the output value of the loss function increases with an increase in the degree of quantization, the second inference model is searched for to cause the output value of the loss function to be less than or equal to the threshold. Stated differently, even when the loss is large due to a large degree of quantization, the second inference model is searched for to satisfy a certain condition for reducing the loss. This thus enables to find an inference model that is expected to reduce the loss at a constant level.
Also, for example, processor 301 may obtain difficulty level information indicating an inference difficulty level of at least one of the first inference model, the second inference model, or the third inference model. Processor 301 may then change the loss function, based on the difficulty level information.
This enables to find an inference model that is expected reduce a loss caused by quantization, on the basis of the loss function that is based on the inference difficulty level.
Also, for example, the loss function may be changed to increase the output value of the loss function with an increase in the inference difficulty level indicated by the difficulty level information. The search for the second inference model may then be performed to cause the output value of the loss function to be less than or equal to a threshold.
With this, although the output value of the loss function increases with an increase in the inference difficulty level, the second inference model is searched for to cause the output value of the loss function to be less than or equal to the threshold. Stated differently, even when the loss is large due to a high inference difficulty level, the second inference model is searched for to satisfy a certain condition for reducing the loss. This thus enables to find an inference model that is expected to reduce the loss at a constant level.
Also, for example, processor 301 may change settings for the quantizing of the second inference model when the performance fails to satisfy the condition.
With this, there is a possibility that the settings for quantization are changed to satisfy the condition for the performance. This thus enables to find an inference model that satisfies the condition for the performance.
Also, for example, the condition may include accuracy or correctness of an inference of the third inference model with respect to an inference result of the first inference model or reference data. Processor 301 may then decrease a degree of the quantization when the accuracy or the correctness of the inference of the third inference model is less than or equal to a threshold.
With this, when the accuracy or the correctness of the inference of the third inference model is less than or equal to the threshold, the degree of quantization to be performed on the second inference model is decreased to improve the accuracy or the correctness of the inference of the third inference model. This thus enables to find an inference model that is expected to satisfy the condition for the accuracy and correctness of inference.
Also, for example, the condition may include a speed of inference processing of the third inference model. Processor 301 may then increase a degree of the quantization when the speed of the inference processing is less than or equal to a threshold.
With this, when the speed of the inference processing of the third inference model is less than or equal to the threshold, the degree of quantization to be performed on the second inference model is increased to increase the speed of the inference processing of the third inference model. This thus enables to find an inference model that is expected to satisfy the condition for the speed of the inference processing.
Also, for example, processor 301 may input data to the first inference model to obtain the inference result of the first inference model. Processor 301 may also input the data to the second inference model to obtain the inference result of the second inference model. Processor 301 may then train the first inference model on the basis of the difference between the inference result of the first inference model and the inference result of the second inference model.
With this, the first inference model and the third inference model are built, on the basis of the same second inference model. This thus reduces the difference between the inference result of the first inference model and the inference result of the third inference model.
Also, for example, processor 301 may further perform the processing shown in any one of the foregoing embodiments.
Each of these elements is an information processing circuit that performs information processing. These elements may be implemented as processor 301 shown in
Computing processor 401 is an element that corresponds to, for example, network searcher 201. Computing processor 401 performs processing that relates to the computing of the second inference model. More specifically, computing processor 401 performs the obtainment of the first inference model (S701) and the computing of the second inference model (S702) shown in
Generator 402 is an element that corresponds to, for example, size reductor 202. Generator 402 performs processing that relates to the quantization of the second inference model and the generation of the third inference model. More specifically, generator 402 performs the generation of the third inference model (S703) shown in
Trainer 403 is an element that corresponds to, for example, learning processor 204. Trainer 403 performs processing that relates to the training of the third inference model. More specifically, trainer 403 performs the training of the third inference model (S704) shown in
Determiner 404 is an element that corresponds to, for example, evaluation value calculator 203. Determiner 404 performs processing that relates to the determination of whether the performance of the third inference model satisfies the condition. More specifically, determiner 404 performs the determination (S705) shown in
Outputter 405 is an element that corresponds to, for example, evaluation value calculator 203. Outputter 405 performs processing that relates to the output of the third inference model. More specifically, outputter 405 performs the output of the third inference model (S706) shown in
Initial value setter 406 is an element that corresponds to, for example, network searcher 201. Initial value setter 406 performs processing that relates to the setting of the initial value for the computing of the second inference model. Loss function changer 407 is an element that corresponds to, for example, evaluation value calculator 203. Loss function changer 407 performs processing that relates to the change of the loss function. Quantization settings changer 408 is an element that corresponds to, for example, learning processor 204. Quantization settings changer 408 performs processing that relates to the change of the settings for quantization.
Conversely, network searcher 201 correspond to, for example, such as computing processor 401 and initial value setter 406. Size reductor 202 corresponds to, for example, generator 402. Evaluation value calculator 203 corresponds to, for example, determiner 404, outputter 405, and loss function changer 407. Learning processor 204 and learning processor 206 correspond to, for example, trainer 403 and quantization settings changer 408.
Note that the configuration shown in
The aspects of the information processing system have been described above on the basis of the embodiments and so forth, but the aspects of the information processing system are not limited to such embodiments and so forth. Variations that can be conceived by those skilled in the art may be applied to the embodiments and so forth, and a plurality of elements in the embodiments and so forth may be freely combined. For example, processing performed by a specific element in the embodiments and so forth may be performed by another element instead of such specific element. Also, the processing order of a plurality of processes may be changed and a plurality of processes may be performed in parallel.
Also, each of the inference models is, for example, a mathematical model for performing inference processing, and may be any one of a machine learning model, a neural network model, or a deep learning model.
Also, the information processing method that includes the steps performed by the elements of the information processing system may be performed by a device or a system. Stated differently, the information processing method may be performed by the information processing system, or may be performed by another device or system.
For example, the foregoing information processing method may be performed by a computer that includes a processor, a memory, an input and output circuits, and so forth. In so doing, the information processing method may be performed by the computer executing a program for causing the computer to perform the information processing method. Also, the program may be recorded in a non-transitory, computer-readable recording medium.
For example, such program causes the computer to execute the information processing method that includes: obtaining a first inference model serving as a reference; computing a second inference model that is larger than the first inference model in model size, based on the first inference model; quantizing the second inference model computed to generate a third inference model; training the third inference model, using machine learning; determining whether a performance of the third inference model trained satisfies a condition; and outputting the third inference model trained, when the performance satisfies the condition.
Also, a plurality of elements of the information processing system may be configured by a dedicated hardware product or a general-purpose hardware product for executing the foregoing program, etc., or a combination of these. The general-purpose hardware product may be configured by, for example, a memory that stores the program, and a general-purpose processor that reads the program from the memory and executes the program. Here, the memory may be, for example, a semiconductor memory or a hard disk, and the general-purpose processor may be, for example, a CPU.
Also, the dedicated hardware product may be configured by, for example, a memory and a dedicated processor. For example, the dedicated processor may refer to the memory to execute the foregoing information processing method.
Also, the elements of the information processing system may be electric circuits. These electric circuits may collectively form a single electric circuit or may be independent electric circuits. Also, these electric circuits may correspond to the dedicated hardware product, or may correspond to the general-purpose hardware product that executes the foregoing program, etc.
The present disclosure is applicable for use as an information processing system for finding an inference model that is expected to reduce a loss caused by quantization. Example applications of the present disclosure include a machine learning model building system, a neural network model building system, a deep learning model building system, etc.
Number | Date | Country | Kind |
---|---|---|---|
2021-033329 | Mar 2021 | JP | national |
This is a continuation application of PCT International Application No. PCT/JP2021/020527 filed on May 28, 2021, designating the United States of America, which is based on and claims priority of U.S. Provisional Patent Application No. 63/069,266 filed on Aug. 24, 2020, and Japanese Patent Application No. 2021-033329 filed on Mar. 3, 2021. The entire disclosures of the above-identified applications, including the specifications, drawings and claims are incorporated herein by reference in their entirety.
Number | Date | Country | |
---|---|---|---|
63069266 | Aug 2020 | US |
Number | Date | Country | |
---|---|---|---|
Parent | PCT/JP2021/020527 | May 2021 | US |
Child | 18109340 | US |