MACHINE LEARNING DEVICE, MACHINE LEARNING METHOD, AND NON-TRANSITORY COMPUTER-READABLE RECORDING MEDIUM HAVING EMBODIED THEREON A MACHINE LEARNING PROGRAM

Information

  • Patent Application
  • 20230376763
  • Publication Number
    20230376763
  • Date Filed
    July 10, 2023
    12 months ago
  • Date Published
    November 23, 2023
    7 months ago
Abstract
A weight storage unit stores weights of a plurality of filters used to detect a feature of a task. A continual learning unit trains the weights of the plurality of filters in response to an input task in continual learning. A filter control unit compares, after a predetermined epoch number has been learned in continual learning, the weight of a filter that has learned the task with the weight of a filter that is learning the task, extracts overlap filters having a similarity in weight equal to or greater than a predetermined threshold value as shared filters shared by tasks, and leaves one of the overlap filters as the shared filter and initializes the weights of filters other than the shared filter.
Description
BACKGROUND OF THE INVENTION
1. Field of the Invention

The present invention relates to machine learning technologies.


2. Description of the Related Art

Human beings can learn new knowledge through experiences over a long period of time and can maintain old knowledge without forgetting it. Meanwhile, the knowledge of a convolutional neutral network (CNN) depends on the dataset used in learning. To adapt to a change in data distribution, it is necessary to re-train CNN parameters in response to the entirety of the dataset. In CNN, the precision estimation for old tasks will be decreased as new tasks are learned. Thus, catastrophic forgetting cannot be avoided in CNN. Namely, the result of learning old tasks is forgotten as new tasks are being learned in successive learning.


Incremental learning or continual learning is proposed as a scheme to avoid catastrophic forgetting. One scheme for continual learning is PackNet.


Patent document 1 discloses a learning device configured to cause two or more learning modules to share model parameters updated by multiple learning modules.

  • [Patent Literature 1] JP2010-20446


SUMMARY OF INVENTION

The problem of catastrophic forgetting can be avoided in PackNet, which is one scheme for continual learning. In PackNet, however, the number of filters in a model is limited, and there is a problem in that filters will be saturated as new tasks are learned so that the number of tasks that can be learned is limited.


SUMMARY OF THE INVENTION

The present disclosure addresses the issue, and a purpose thereof is to provide a machine learning technology capable of mitigating saturation of filters.


A machine learning device according to an aspect of the embodiment includes: a weight storage unit that stores weights of a plurality of filters used to detect a feature of a task; a continual learning unit that trains the weights of the plurality of filters in response to an input task in continual learning; and a filter control unit that, after a predetermined epoch number has been learned in continual learning, compares the weight of a filter that has learned the task with the weight of a filter that is learning the task and extracts overlap filters having a similarity in weight equal to or greater than a predetermined threshold value as shared filters shared by tasks.


Another aspect of the embodiment relates to a machine learning method. The method includes: training weights of a plurality of filters used to detect a feature of a task in response to an input task in continual learning; and comparing, after a predetermined epoch number has been learned in continual learning, the weight of a filter that has learned the task with the weight of a filter that is learning the task and extracting overlap filters having a similarity in weight equal to or greater than a predetermined threshold value as shared filters shared by tasks.


Optional combinations of the aforementioned constituting elements, and implementations of the embodiment in the form of methods, apparatuses, systems, recording mediums, and computer programs may also be practiced as additional modes of the present disclosure.





BRIEF DESCRIPTION OF THE DRAWINGS


FIGS. 1A-1E show continual learning, which is defined as a base technology;



FIG. 2 shows a configuration of a machine learning device according to the embodiment;



FIGS. 3A-3E show continual learning performed by the machine learning device of FIG. 2;



FIG. 4 shows an operation of the filter control unit of the machine learning device of FIG. 2; and



FIG. 5 is a flowchart showing a sequence of steps of continual learning performed by the machine learning device of FIG. 2.





DETAILED DESCRIPTION OF THE INVENTION

The invention will now be described by reference to the preferred embodiments. This does not intend to limit the scope of the present invention, but to exemplify the invention.



FIGS. 1A-1E show continual learning by PackNet, which is defined as a base technology. In PackNet, the weights of multiple filters in a model are trained in response to a given task. The figures show multiple filters in each layer of a convolutional neural network arranged in a lattice.


The learning process in PackNet proceeds in the following steps (A)-(E).

    • (A) The model learns task 1. FIG. 1A shows an initial state of the filters that have learned task 1. All filters have learned task 1 and are shown in black.
    • (B) The filters are arranged in the descending order of weight value of the filter. The values of 60% of the entire filters are initialized in the ascending order of weight value. FIG. 1B shows a final state of the filters that have learned task 1. The initialized filters are shown in white.
    • (C) Task 2 is then learned. In this step, the weight values of the black filters of FIG. 1B are locked. The weight values of only the white filters can be changed. FIG. 1C shows an initial state of the filters that have learned task 2. All filters shown in white in FIG. 1B have learned task 2 and are shown in hatched lines in FIG. 1C.
    • (D) As in step (B), the hatched filters that have learned task 2 are arranged in the descending order of weight value of the filter. The values of 60% of the entire filters are initialized in the ascending order of weight value. FIG. 1D shows a final state of the filters that have learned task 2. The initialized filters are shown in white.
    • (E) Further, task 3 is learned. In this step, the weight values of the black and hatched filters of FIG. 1D are locked. The weight values of only the white filters can be changed. FIG. 1E shows an initial state of the filters that have learned task 3. All filters shown in white in FIG. 1D have learned task 3 and are shown in horizontal stripes in FIG. 1E.


As learning continues through task N in the learning process according to PackNet in this way, the number of initialized white filters will be increasingly smaller, resulting in saturation. When the filters are saturated, it will no longer be possible to learn a new task.


Saturation of the PackNet filters at some point of time cannot be avoided. However, the speed of saturation of the filters can be mitigated. The embodiment addresses the issue by extracting, in the process of learning a current task, overlap filters having a high similarity in weight as shared filters shared by tasks. Of the overlap filters, one filter is left as a shared filter, and the weight of the filters other than the shared filter is initialized to 0. The filter for which the weight is initialized is excluded from the training in response to the current task. This makes it possible to increase filters that can be learn a new task, mitigate the speed of saturation of the filters, and increase the number of filters that can learn a task.



FIG. 2 shows a configuration of a machine learning device 100 according to the embodiment. The machine learning device 100 includes an input unit 10, a continual learning unit 20, a filter processing unit 30, a filter control unit 40, a weight storage unit 50, an inference unit 60, and an output unit 70.


The input unit 10 supplies a supervised task to the continual learning unit 20 and supplies an unknown task to the inference unit 60. By way of one example, the task is image recognition. The task is set to recognize a particular object. For example, task 1 is recognition of a cat, task 2 is recognition of a dog, etc.


The weight storage unit 50 stores the weights of multiple filters used to detect a feature of the task. By running an image through multiple filters, the feature of the image can be captured.


The continual learning unit 20 continually trains the weights of the multiple filters in the weight storage unit 50 in response to the input supervised task and saves the updated filter weights in the weight storage unit 50.


After the continual learning unit 20 learns the current task to complete a predetermined epoch number, the filter control unit 40 compares the weights of multiple filters learning the current task with the weights of multiple filters that have learned a past task and extracts overlap filters having a similarity in weight equal to or greater than a predetermined threshold value as shared filters shared by tasks. The model is a multi-layer convolutional neural network so that a similarly between the weights of multiple filters are calculated in each layer. Of the overlap filters, the filter control unit 40 leaves one filter as a shared filter, initializes the weights of the filters other than the shared filter, and saves the weights in the weight storage unit 50. The overlap filter for which the weight is initialized is excluded from the training in response to the current task and is used to learn the next task.


The predetermined epoch number is, for example, 10. It is desirable that the filter control unit 40 initializes similar filters after the learning is stabilized to a certain degree. The number of times or duration of learning before the learning is stabilized varies from one task to another. It is therefore preferable to adjust the epoch number based on a relationship between loss and accuracy. Loss is defined as an error between an output value from the neural network and a correct answer given by the training data, and accuracy is defined as an accuracy rate of an output value from the neural network.


That the learning is stabilized is determined by using one of the conditions below, and the predetermined epoch number is configured accordingly.

    • (1) Loss is equal to or lower than a certain level (e.g., 0.75).
    • (2) Accuracy is equal to or greater than a certain level (e.g., 0.75).
    • (3) Both conditions (1) and (2) are met.


The filter processing unit 30 locks, of the multiple filters that have learned one task, the weights of a predetermined proportion of the filters to prevent them from being used to learn a further task and initializes the weights of the rest of the filters to use them to learn a further task. For example, the filters are arranged in the descending order of filter weight. The weights of 40% of the filters are locked in the descending order of weight value, and the weights of the remaining 60% of the filters are initialized to use them to learn a further task.


The continual learning unit 20 continually trains the initialized weights of the filters in response to a new task.


The inference unit 60 uses the filter weight saved in the weight storage unit 50 to infer from an input unknown task. The output unit 70 outputs a result of inference by the inference unit 60.



FIGS. 3A-3E show continual learning performed by the machine learning device 100 of FIG. 2. Multiple filters in each layer of a convolutional neural network are shown arranged in a lattice, where (i,j) denote a filter in the i-th row and the j-th column.


The learning process in the machine learning device 100 proceeds in the following steps (A)-(E).

    • (A) The model learns task 1. FIG. 3A shows an initial state of the filters that have learned task 1. All filters have learned task 1 and are shown in black.
    • (B) The filters are arranged in the descending order of weight value of the filter. The values of 60% of the entire filters are initialized in the ascending order of weight value. FIG. 3B shows a final state of the filters that have learned task 1. The initialized filters are shown in white.
    • (C) Task 2 is then learned. In this step, the weight values of the black filters of FIG. 3B are locked. The weight values of only the white filters can be changed. If a filter used in task 2 is identified, in the process of learning task 2, to be similar to the filter (black) that has learned task 1, the filter control unit 40 exercises control to initialize the identified filter and exclude it from the training in response to the task 2. FIG. 3C shows an initial state of the filters that have learned task 2. Of the filters shown in white in FIG. 3B, the filters that have learned task 2 are shown in hatched lines in FIG. 3C. Of the filters shown in white in FIG. 3B, the filters that are initialized in the process of learning task 2 and excluded from the training are shown in white in FIG. 3C. In this case, the (1,1) filter and the (1,5) filter are initialized in the process of learning task 2 and are available for use in subsequent new tasks.
    • (D) As in step (B), the hatched filters that have learned task 2 are arranged in the descending order of weight value of the filter. The values of 60% of the entire filters are initialized in the ascending order of weight value. FIG. 3D shows a final state of the filters that have learned task 2. The initialized filters are shown in white.
    • (E) Further, task 3 is learned. In this step, the weight values of the black and hatched filters of FIG. 3D are locked. The weight values of only the white filters can be changed. If a filter used in task 3 is identified, in the process of learning task 3, to be similar to the filter (black) that has learned task 1 or the filter (hatched) that has learned task 2, the filter control unit 40 exercises control to initialize the identified filter and exclude it from the training in response to the task 3. FIG. 3E shows an initial state of the filters that have learned task 3. Of the filters shown in white in FIG. 3D, the filters that have learned task 3 are shown in horizontal stripes in FIG. 3E. Of the filters shown in white in FIG. 3D, the filters that are initialized in the process of learning task 3 and excluded from the training are shown in white in FIG. 3E. In this case, the (1,1) filter, the (1,5) filter, and the (2,2) filter are initialized in the process of learning task 3 and are available for use in subsequent new tasks.


By executing a similar learning process subsequently through task N, it is possible to remove overlapping of filters between tasks in the process of learning, to mitigate saturation of the filters, and increase the number of tasks that can be learned.



FIG. 4 shows an operation of the filter control unit 40 of the machine learning device 100 of FIG. 2.


When the weights of filters are being trained through back propagation, which is a supervised learning method for a neural network, the filter control unit 40 compares the weight of a filter for a task which is being learned and for which a predetermined epoch number has been learned with the weight of a filter for a task that has been learned. When the weights are similar, the filter control unit 40 initializes the weight of the filter for the task that is being learned and excludes the filter from the training in response to the current task.


Since the model includes multiple layers, comparison is made in each layer. For example, one layer includes 128 filters. Given that there are 51 filters that have learned task 1 and 30 filters that have learned task 2, and the remaining filters are initialized, a similarity between the 51 filters for task 1 and the 30 filters of task 2 are calculated.


A similarity is calculated by comparing the absolute values of the filter weight values. In the case of 3×3 filters, for example, the absolute values of the nine weights are compared. A threshold value is defined. When the similarity exceeds the threshold value, it is determined that two filters overlap, and the weight of the filter for task 2 is initialized to 0. The filter for task 2 is excluded from the subsequent training in response to task 2.


Given that each component of filter A is defined by aij and each component of filter B is defined by bij, a difference in absolute value between the values at the same position in the two filters A, B is calculated as given by d1 (A, B), d2(A, B), d(A, B), and dm(A,B).











d
1



(

A
,
B

)


=




i
=
1

n






j
=
1

n





"\[LeftBracketingBar]"



a
ij

-

b
ij




"\[RightBracketingBar]"












d
2



(

A
,
B

)


=





i
=
1

n






j
=
1

n




(


a
ij

-

b
ij


)

2












d




(

A
,
B

)


=


max

1

i

n




max

1

j

n






"\[LeftBracketingBar]"



a
ij

-

b
ij




"\[RightBracketingBar]"











d
m



(

A
,
B

)


=

max



(






(

A
-
B

)


x



:

x



n



,



x


=
1


)









In the above description, a similarly between filters is calculated by calculating a difference in absolute value between the values at the same position in the two filters. A similarly may be calculated by a method other than this. For example, a filter sum of absolute difference is defined for each filter as a sum of a horizontal sum of absolute difference SAD_H and a vertical sum of absolute difference SAD_V such that SAD=SAD_H+SAD_V. When a difference between the filter sum of absolute difference SAD_A of filter A and the filter sum of absolute difference SAD_B of filter B is smaller than a threshold value, it may be determined that filter A and filter B overlap. Given here that components of a 3×3 filter in the first row are a1, a2, a3, the components in the second row are a4, a5, a6, and the components in the third row are a7, a8, a9, the horizontal sum of absolute difference SAD_H and the vertical sum of absolute difference SAD_V are given by the following expression.





SAD_H=|a1−a2|+|a2−a3|+|a4−a5|+|a5−a6|+|a7−a8|+|a8−a9|





SAD_V=|a1−a4|+|a2−a5|+|a3−a6|+|a4−a7|+|a5−a8|+|a6−a9|


As an alternative method of calculating a similarity, comparison of a Euclid distance or a cosine distance may be used.


When filters have a high similarity in weight, the filters are determined to have identical of hardly different characteristics across tasks so that there is no need to maintain a filter that overlaps. Accordingly, one of such filters is initialized and used to learn a further task. The weight is defined as that of one component in a filter. In the case of the 3×3 filter of FIG. 4, the weight is defined as that of one cell in the matrix. Alternatively, the weight may be defined in units of filters, i.e., in units of matrices.


More generally speaking, when there is a filter that overlaps across task N that has been learned and task N+1 that is being learned, the weight of the filter for task N+1 that is being learned is initialized to 0 in order to maintain the performance at task N at the maximum level. This makes it possible to utilize limited filter resources maximally.



FIG. 5 is a flowchart showing a sequence of steps of continual learning performed by the machine learning device 100 of FIG. 2.


The input unit 10 inputs a current supervised task to the continual learning unit 20 (S10).


The continual learning unit 20 continually trains the weights of multiple filters in response a current task to complete a predetermined epoch number (S20).


The filter control unit 40 compares the filter learning the current task with the filter that has learned a past task and calculates a similarity in weight (S30).


The filter control unit 40 initializes the filter learning the current task having a high similarity with the filter that has learned the past task (S40).


When the learning of the current task is completed (Y in S50), control proceeds to step S60. When the current task continues to be learned (N in S50), control is returned to step S20.


The filter processing unit 30 initializes a predetermined proportion of the multiple filters that have learned the current task in the ascending order of weight (S60).


When a task remains, control is returned to step S10, and the next task is input (N in S70). When there are no more tasks, continual learning is terminated (Y in S70).


The above-described various processes in the machine learning device 100 can of course be implemented by hardware-based devices such as a CPU and a memory and can also be implemented by firmware stored in a read-only memory (ROM), a flash memory, etc., or by software on a computer, etc. The firmware program or the software program may be made available on, for example, a computer readable recording medium. Alternatively, the program may be transmitted and received to and from a server via a wired or wireless network. Still alternatively, the program may be transmitted and received in the form of data broadcast over terrestrial or satellite digital broadcast systems.


As described above, the machine learning device 100 according to the embodiment makes it possible to mitigate the speed of saturation of filters in a continual learning model and to learn more tasks by using filters efficiently.


The present invention has been described above based on an embodiment. The embodiment is intended to be illustrative only and it will be understood by those skilled in the art that various modifications to combinations of constituting elements and processes are possible and that such modifications are also within the scope of the present invention.

Claims
  • 1. A machine learning device comprising: a weight storage unit that stores weights of a plurality of filters used to detect a feature of a task;a continual learning unit that trains the weights of the plurality of filters in response to an input task in continual learning; anda filter control unit that, after a predetermined epoch number has been learned in continual learning, compares the weight of a filter that has learned the task with the weight of a filter that is learning the task and extracts overlap filters having a similarity in weight equal to or greater than a predetermined threshold value as shared filters shared by tasks.
  • 2. The machine learning device according to claim 1, wherein the filter control unit leaves one of the overlap filters as the shared filter and initializes the weights of filters other than the shared filter.
  • 3. The machine learning device according to claim 2, wherein the continual learning unit trains initialized weights of filters other than the shared filter in response to a further task in continual learning.
  • 4. The machine learning device according to claim 1, wherein the predetermined epoch number is determined based on a condition related to a change rate in loss defined as an error between an output value from a learning model and a correct answer given by training data or to a change rate in accuracy defined as an accuracy rate of an output value from a learning model.
  • 5. A machine learning method comprising: training weights of a plurality of filters used to detect a feature of a task in response to an input task in continual learning; andcomparing, after a predetermined epoch number has been learned in continual learning, the weight of a filter that has learned the task with the weight of a filter that is learning the task and extracting overlap filters having a similarity in weight equal to or greater than a predetermined threshold value as shared filters shared by tasks.
  • 6. A non-transitory computer-readable recording medium having embodied thereon a machine learning program comprising computer-implemented modules including: a module that trains weights of a plurality of filters used to detect a feature of a task in response to an input task in continual learning; anda module that compares, after a predetermined epoch number has been learned in continual learning, the weight of a filter that has learned the task with the weight of a filter that is learning the task and extracts overlap filters having a similarity in weight equal to or greater than a predetermined threshold value as shared filters shared by tasks.
Priority Claims (1)
Number Date Country Kind
2021-003241 Jan 2021 JP national
CROSS REFERENCE TO RELATED APPLICATION

This application is a continuation of application No. PCT/JP2021/045340, filed on Dec. 9, 2021, and claims the benefit of priority from the prior Japanese Patent Application No. 2021-003241, filed on Jan. 13, 2021, the entire content of which is incorporated herein by reference.

Continuations (1)
Number Date Country
Parent PCT/JP2021/045340 Dec 2021 US
Child 18349195 US