This application is based upon and claims the benefit of priority from Japanese Patent Application No. 2015-244307, filed Dec. 15, 2015, the entire contents of which are incorporated herein by reference.
Embodiments described herein relate generally to a server, a system and a search method.
In the field of image and voice recognition, recognition performance has been gradually enhanced using mechanical learning, such as a support vector machine (SVM). Further, in recent years, multilayer neural networks have been employed, which significantly enhances recognition performance. Particular attention has been paid to a deep learning technique using the multilayer neural network, and the deep learning technique is now also applied to a field of, for example, natural language analysis, as well as image and voice recognition.
However, the deep learning technique requires a vast number of calculations for learning, and hence requires a lot of time. Further, in deep learning, many hyper-parameters (parameters that define learning operations), such as the number of nodes in each layer, the number of layers, the rate of learning, etc., are used. Furthermore, depending on values of hyper-parameters, recognition performance greatly varies. Accordingly, it is necessary to search for a combination of hyper-parameters that provides best recognition performance. In the search for hyper-parameter combinations, a method is adopted in which learning is performed while changing the combination of hyper-parameters, and a combination for realizing best recognition performance is selected from learning results based on respective combinations.
In the above-mentioned deep learning, the conventional search method of selecting an optimal combination of hyper-parameters (for obtaining good recognition performance) from a large number of parameters requires a lot of time, since the total of parameter combinations is enormous.
A general architecture that implements the various features of the embodiments will now be described with reference to the drawings. The drawings and the associated descriptions are provided to illustrate the embodiments and not to limit the scope of the invention.
Various embodiments will be described hereinafter with reference to the accompanying drawings. In general, according to one embodiment, a server configured to construct a neural network for performing deep learning, and to search for parameters defining a learning operation, the server, a second server and a third server included in a system, the server also configured to specify, from a search range of the parameters, a first combination of first initial parameters and a second combination of second initial parameters, using a search method based on a uniform distribution; transmit the first combination of first initial parameters to the second server; transmit the second combination of second initial parameters to the third server; receive, from the second server, a first learning result based on the first combination of first initial parameters; receive, from the third server, a second learning result based on the second combination of second initial parameters; specify, from the aearch range of the parameters, a third combination of third parameters, based on the first and second learning results and using a search method based on a probability distribution; transmit the third combination of third parameters to the second or third server; and receive, from the second or third server, a third learning result based on the third combination of third parameters.
Embodiments will be described hereinafter with reference to the accompanying drawings.
As shown in
The manager 11 is a server for managing hyper-parameter search processing, and comprises a hyper-parameter search range storage unit 111, a hyper-parameter candidate generator 112, and a task dispatching unit 113, as specifically shown in
The random system is a search system based on a uniform distribution, and excels in a discrete parameter search and a search independent of an initial value. The Bayesian method is a type of gradient method, and is a search method based on a probability distribution. It is configured to search for an optimal solution in the vicinity of values obtained by past searches, and excels in searching for sequential parameters. Regarding particulars of the Bayesian method, the following discloses an open-source hyper-parameter search environment based on a Bayesian search, and processing including processing of distributing tasks to a plurality of servers:
A treatise: Practical Bayesian Optimization of Machine Learning Algorithms
(http://papers.nips.cc/paper/4522-practical-bayesian-optimization-of-machine-learning-algorithms.pdf)
Open-source environment: Spearmint (https://github.com/JasperSnoek/spearmint) Latest commit 0544113 on Oct. 31, 2014
The above-described task dispatching unit 113 distributes, as tasks to workers 12-i, learning processing of respective candidates generated by the hyper-parameter candidate generator 112, thereby instructing learning.
In contrast, workers 12-i receive, from the manager 11, candidates of combinations of hyper-parameters, perform learning associated with the received candidates, and sends results of learning, such as a recognition ratio, an error rate and cross-entropy, to the hyper-parameter candidate generator 112 of the manager 11.
A description will now be given of processing of searching for hyper-parameter combinations.
In light of the above, the server system of the embodiment is made to have a cluster structure comprising one server 11 called a manager, and a plurality of servers 12-i called workers, thereby realizing an efficient and fast search for an optimal combination of hyper-parameters.
In contrast, if there is no other search, subsequent hyper-parameter candidates that reflect the results of learning collected in the steps up to step S16 are generated (step S17). Since past search results are prepared for candidate generation at this time, the Bayesian method is adopted. Generated candidates are issued as tasks to arbitrary workers 12-i to instruct them to perform learning (step S18), and the end of the tasks is waited for (step S19). Upon receiving a response indicating the end of a task from each worker 12-i, the manager receives therefrom a result of learning (step S20). If another search remains, the program returns to step S17, where the manager re-issues tasks (step S21). In contrast, if there is no other search, this processing is finished.
Considering that hyper-parameters of good performance may not be detected by the Bayesian method because of initial value dependency, a random search is performed first, and a subsequent search is performed using the Bayesian method. As a result, efficient searching that utilizes the advantages of the respective methods is realized.
The above-mentioned procedure enables a hyper-parameter for deep learning to be efficiently searched for.
A description will now be given of examples of the above-described embodiment for realizing further promotion of efficiency.
In hyper-parameter search for deep learning that utilizes a neural network, it is common practice to perform searching while changing only the value of a hyper-parameter in a fixed neural network. However, it may be more efficient to perform searching while changing the number of layers of the neural network, instead of changing only the hyper-parameter value.
To search for the number of layers, the hyper-parameter candidate generator 112 of the manager 11 generates a parameter indicating a changed number of layers. If the number of nodes in a certain layer of the neural network is zero, this layer is considered not to exist. When the number of nodes in a certain layer of the neural network is zero, each worker 12-i performs learning assuming that the neural network does not have the layer, and transmits the result of learning to the manager 11. Thus, searching with the number of layers changed can be executed.
It is known that deep learning utilizing a neural network requires a long learning period, since in this method, the performance of learning is enhanced by performing learning with the same data repeatedly input a few dozen times or more. In the case of a good-performance hyper-parameter, it is meaningful to enhance the performance with the same data repeatedly input a few dozen times. However, in the case of a low-performance hyper-parameter, even if this parameter is input a few dozen times for learning, it is not reflected in the learning, with the result that the time used for this processing will be wasted. In view of this, each worker 12-i monitors an index, such as a recognition ratio, during learning, interrupts learning when a hyper-parameter being used for learning is determined to be low in performance, and transmits, to the manager 11, the result of learning assumed when it is interrupted. It is supposed, as described above, that an index to be monitored during learning and to be transmitted to the manager 11 is, for example, a recognition ratio, an error ratio or cross-entropy.
A specific example is shown in
For example, if the number of repetitions of learning by each worker 12-i is 100, it is assumed that learning is interrupted when the recognition ratio is 90% or less after the learning is repeated 50 times, and is continued up to 100 times when the recognition ratio is greater than 90% after the learning is repeated 50 times. That is, if the recognition ratio is 93% with a high-performance hyper-parameter, learning is continued up to 100 times. In contrast, if learning is performed with a low-performance hyper-parameter, a recognition ratio of 85% is obtained after 50 times learning, the learning is interrupted at this point, instead of continuing the learning up to 100 times, and an index indicating the result of learning obtained when the learning was interrupted is transmitted to the manager 11. This can reduce wasted learning time to thereby enhance the efficiency of the entire processing.
In the above-mentioned example, although the recognition ratio is determined using a threshold of 90%, another determination method may be employed. For instance, learning may be interrupted when the recognition ratio is not increased even after learning is repeated ten times, or when the inclination of a learning curve becomes a predetermined value or less.
By virtue of the above-described processing, in the case of a low-performance hyper-parameter, learning can be interrupted to omit wasted learning time, thereby enabling efficient hyper-parameter searching.
It is known that deep learning utilizing a neural network requires a long learning period. In order to shorten the learning period, the amount of learning data used by each worker 12-i during learning may be halved.
In deep learning utilizing the neural network, an initial value for weighting is generated at random. The performance of learning will slightly vary depending upon the initial value. Because of this, each worker 12-i may perform learning with the weighting initial value changed a number of times, and may transmit, to the manager 11, an index indicating an average result of learning. This enables hyper-parameter searching to be performed stably.
In deep learning utilizing the neural network, an initial weight is generated at random. Because of the randomly generated weight, a slight performance difference may occur. In this case, the same performance may not be obtained even after learning is repeated using the same hyper-parameter. In light of this, each worker 12-i may store a model (a result of deep learning) of the highest performance, and sends it to the manager 11, along with the result of learning.
In deep learning utilizing the neural network, the performance is enhanced by performing learning using the same data repeatedly input a few dozen times or more. In this case, however, such an index of a learning result as recognition performance may be degraded because of excessive learning resulting from a predetermined number or more of repetitions of learning. In light of this, each worker 12-i may monitor such an index of a learning result as recognition performance each time it performs learning using data input once, and may store a model (a result of deep learning) of the highest performance.
While certain embodiments have been described, these embodiments have been presented by way of example only, and are not intended to limit the scope of the inventions. Indeed, the novel embodiments described herein may be embodied in a variety of other forms; furthermore, various omissions, substitutions and changes in the form of the embodiments described herein may be made without departing from the spirit of the inventions. The accompanying claims and their equivalents are intended to cover such forms or modifications as would fall within the scope and spirit of the inventions.
Number | Date | Country | Kind |
---|---|---|---|
2015-244307 | Dec 2015 | JP | national |