The present invention relates to machine learning technologies.
Human beings can learn new knowledge through experiences over a long period of time and can maintain old knowledge without forgetting it. Meanwhile, the knowledge of a convolutional neutral network (CNN) depends on the dataset used in learning. To adapt to a change in data distribution, it is necessary to re-train CNN parameters in response to the entirety of the dataset. In CNN, the precision estimation for old tasks will be decreased as new tasks are learned. Thus, catastrophic forgetting cannot be avoided in CNN. Namely, the result of learning old tasks is forgotten as new tasks are being learned in successive learning.
Incremental learning or continual learning is proposed as a scheme to avoid catastrophic forgetting. One scheme for continual learning is PackNet.
Patent document 1 discloses a learning device configured to cause two or more learning modules to share model parameters updated by multiple learning modules.
The problem of catastrophic forgetting can be avoided in PackNet, which is one scheme for continual learning. In PackNet, however, the number of filters in a model is limited, and there is a problem in that filters will be saturated as new tasks are learned so that the number of tasks that can be learned is limited.
The present disclosure addresses the issue, and a purpose thereof is to provide a machine learning technology capable of mitigating saturation of filters.
A machine learning device according to an aspect of the embodiment includes: a weight storage unit that stores weights of a plurality of filters used to detect a feature of a task; a continual learning unit that trains the weights of the plurality of filters in response to an input task in continual learning; and a filter control unit that, after a predetermined epoch number has been learned in continual learning, compares the weight of a filter that has learned the task with the weight of a filter that is learning the task and extracts overlap filters having a similarity in weight equal to or greater than a predetermined threshold value as shared filters shared by tasks.
Another aspect of the embodiment relates to a machine learning method. The method includes: training weights of a plurality of filters used to detect a feature of a task in response to an input task in continual learning; and comparing, after a predetermined epoch number has been learned in continual learning, the weight of a filter that has learned the task with the weight of a filter that is learning the task and extracting overlap filters having a similarity in weight equal to or greater than a predetermined threshold value as shared filters shared by tasks.
Optional combinations of the aforementioned constituting elements, and implementations of the embodiment in the form of methods, apparatuses, systems, recording mediums, and computer programs may also be practiced as additional modes of the present disclosure.
The invention will now be described by reference to the preferred embodiments. This does not intend to limit the scope of the present invention, but to exemplify the invention.
The learning process in PackNet proceeds in the following steps (A)-(E).
As learning continues through task N in the learning process according to PackNet in this way, the number of initialized white filters will be increasingly smaller, resulting in saturation. When the filters are saturated, it will no longer be possible to learn a new task.
Saturation of the PackNet filters at some point of time cannot be avoided. However, the speed of saturation of the filters can be mitigated. The embodiment addresses the issue by extracting, in the process of learning a current task, overlap filters having a high similarity in weight as shared filters shared by tasks. Of the overlap filters, one filter is left as a shared filter, and the weight of the filters other than the shared filter is initialized to 0. The filter for which the weight is initialized is excluded from the training in response to the current task. This makes it possible to increase filters that can be learn a new task, mitigate the speed of saturation of the filters, and increase the number of filters that can learn a task.
The input unit 10 supplies a supervised task to the continual learning unit 20 and supplies an unknown task to the inference unit 60. By way of one example, the task is image recognition. The task is set to recognize a particular object. For example, task 1 is recognition of a cat, task 2 is recognition of a dog, etc.
The weight storage unit 50 stores the weights of multiple filters used to detect a feature of the task. By running an image through multiple filters, the feature of the image can be captured.
The continual learning unit 20 continually trains the weights of the multiple filters in the weight storage unit 50 in response to the input supervised task and saves the updated filter weights in the weight storage unit 50.
After the continual learning unit 20 learns the current task to complete a predetermined epoch number, the filter control unit 40 compares the weights of multiple filters learning the current task with the weights of multiple filters that have learned a past task and extracts overlap filters having a similarity in weight equal to or greater than a predetermined threshold value as shared filters shared by tasks. The model is a multi-layer convolutional neural network so that a similarly between the weights of multiple filters are calculated in each layer. Of the overlap filters, the filter control unit 40 leaves one filter as a shared filter, initializes the weights of the filters other than the shared filter, and saves the weights in the weight storage unit 50. The overlap filter for which the weight is initialized is excluded from the training in response to the current task and is used to learn the next task.
The predetermined epoch number is, for example, 10. It is desirable that the filter control unit 40 initializes similar filters after the learning is stabilized to a certain degree. The number of times or duration of learning before the learning is stabilized varies from one task to another. It is therefore preferable to adjust the epoch number based on a relationship between loss and accuracy. Loss is defined as an error between an output value from the neural network and a correct answer given by the training data, and accuracy is defined as an accuracy rate of an output value from the neural network.
That the learning is stabilized is determined by using one of the conditions below, and the predetermined epoch number is configured accordingly.
The filter processing unit 30 locks, of the multiple filters that have learned one task, the weights of a predetermined proportion of the filters to prevent them from being used to learn a further task and initializes the weights of the rest of the filters to use them to learn a further task. For example, the filters are arranged in the descending order of filter weight. The weights of 40% of the filters are locked in the descending order of weight value, and the weights of the remaining 60% of the filters are initialized to use them to learn a further task.
The continual learning unit 20 continually trains the initialized weights of the filters in response to a new task.
The inference unit 60 uses the filter weight saved in the weight storage unit 50 to infer from an input unknown task. The output unit 70 outputs a result of inference by the inference unit 60.
The learning process in the machine learning device 100 proceeds in the following steps (A)-(E).
By executing a similar learning process subsequently through task N, it is possible to remove overlapping of filters between tasks in the process of learning, to mitigate saturation of the filters, and increase the number of tasks that can be learned.
When the weights of filters are being trained through back propagation, which is a supervised learning method for a neural network, the filter control unit 40 compares the weight of a filter for a task which is being learned and for which a predetermined epoch number has been learned with the weight of a filter for a task that has been learned. When the weights are similar, the filter control unit 40 initializes the weight of the filter for the task that is being learned and excludes the filter from the training in response to the current task.
Since the model includes multiple layers, comparison is made in each layer. For example, one layer includes 128 filters. Given that there are 51 filters that have learned task 1 and 30 filters that have learned task 2, and the remaining filters are initialized, a similarity between the 51 filters for task 1 and the 30 filters of task 2 are calculated.
A similarity is calculated by comparing the absolute values of the filter weight values. In the case of 3×3 filters, for example, the absolute values of the nine weights are compared. A threshold value is defined. When the similarity exceeds the threshold value, it is determined that two filters overlap, and the weight of the filter for task 2 is initialized to 0. The filter for task 2 is excluded from the subsequent training in response to task 2.
Given that each component of filter A is defined by aij and each component of filter B is defined by bij, a difference in absolute value between the values at the same position in the two filters A, B is calculated as given by d1 (A, B), d2(A, B), d∞(A, B), and dm(A,B).
In the above description, a similarly between filters is calculated by calculating a difference in absolute value between the values at the same position in the two filters. A similarly may be calculated by a method other than this. For example, a filter sum of absolute difference is defined for each filter as a sum of a horizontal sum of absolute difference SAD_H and a vertical sum of absolute difference SAD_V such that SAD=SAD_H+SAD_V. When a difference between the filter sum of absolute difference SAD_A of filter A and the filter sum of absolute difference SAD_B of filter B is smaller than a threshold value, it may be determined that filter A and filter B overlap. Given here that components of a 3×3 filter in the first row are a1, a2, a3, the components in the second row are a4, a5, a6, and the components in the third row are a7, a8, a9, the horizontal sum of absolute difference SAD_H and the vertical sum of absolute difference SAD_V are given by the following expression.
SAD_H=|a1−a2|+|a2−a3|+|a4−a5|+|a5−a6|+|a7−a8|+|a8−a9|
SAD_V=|a1−a4|+|a2−a5|+|a3−a6|+|a4−a7|+|a5−a8|+|a6−a9|
As an alternative method of calculating a similarity, comparison of a Euclid distance or a cosine distance may be used.
When filters have a high similarity in weight, the filters are determined to have identical of hardly different characteristics across tasks so that there is no need to maintain a filter that overlaps. Accordingly, one of such filters is initialized and used to learn a further task. The weight is defined as that of one component in a filter. In the case of the 3×3 filter of
More generally speaking, when there is a filter that overlaps across task N that has been learned and task N+1 that is being learned, the weight of the filter for task N+1 that is being learned is initialized to 0 in order to maintain the performance at task N at the maximum level. This makes it possible to utilize limited filter resources maximally.
The input unit 10 inputs a current supervised task to the continual learning unit 20 (S10).
The continual learning unit 20 continually trains the weights of multiple filters in response a current task to complete a predetermined epoch number (S20).
The filter control unit 40 compares the filter learning the current task with the filter that has learned a past task and calculates a similarity in weight (S30).
The filter control unit 40 initializes the filter learning the current task having a high similarity with the filter that has learned the past task (S40).
When the learning of the current task is completed (Y in S50), control proceeds to step S60. When the current task continues to be learned (N in S50), control is returned to step S20.
The filter processing unit 30 initializes a predetermined proportion of the multiple filters that have learned the current task in the ascending order of weight (S60).
When a task remains, control is returned to step S10, and the next task is input (N in S70). When there are no more tasks, continual learning is terminated (Y in S70).
The above-described various processes in the machine learning device 100 can of course be implemented by hardware-based devices such as a CPU and a memory and can also be implemented by firmware stored in a read-only memory (ROM), a flash memory, etc., or by software on a computer, etc. The firmware program or the software program may be made available on, for example, a computer readable recording medium. Alternatively, the program may be transmitted and received to and from a server via a wired or wireless network. Still alternatively, the program may be transmitted and received in the form of data broadcast over terrestrial or satellite digital broadcast systems.
As described above, the machine learning device 100 according to the embodiment makes it possible to mitigate the speed of saturation of filters in a continual learning model and to learn more tasks by using filters efficiently.
The present invention has been described above based on an embodiment. The embodiment is intended to be illustrative only and it will be understood by those skilled in the art that various modifications to combinations of constituting elements and processes are possible and that such modifications are also within the scope of the present invention.
Number | Date | Country | Kind |
---|---|---|---|
2021-003241 | Jan 2021 | JP | national |
This application is a continuation of application No. PCT/JP2021/045340, filed on Dec. 9, 2021, and claims the benefit of priority from the prior Japanese Patent Application No. 2021-003241, filed on Jan. 13, 2021, the entire content of which is incorporated herein by reference.
Number | Date | Country | |
---|---|---|---|
Parent | PCT/JP2021/045340 | Dec 2021 | US |
Child | 18349195 | US |