This application claims priority to and the benefit of Netherland Patent Application No. 2032650, titled “Method and System for Multi-Task Structural Learning”, filed on Aug. 1, 2022, and the specification and claims thereof are incorporated herein by reference.
The invention relates to a computer-implemented method and a system for multi-task structural learning in an artificial neural network wherein the architecture and its parameters are learned simultaneously.
Artificial Neural Networks (ANNs) have exhibited strong performance in various tasks essential for scene understanding. Single-Task Learning (STL) [2, 3, 4] has largely been at the center of this exhibit driven by custom task-specific improvements. Despite these improvements, using single task networks for the multiple tasks required for scene understanding comes with notable problems such as a linear increase in computational cost and a lack of inter-task communication.
Multi-Task Learning (MTL), on the other hand, with the aid of shared layers provides favorable benefits over STL such as improved inference efficiency and positive information transfer between tasks. However, a notable drawback of sharing layers is task interference. Existing works have attempted to alleviate task interference using architecture modifications [5, 6], by determining which tasks to group together using a notion of similarity [7, 8, 9], by balancing task loss functions [10, 11, 12, 13], or by learning the architecture [14, 15]. Although these methods have shown promise, progress can be made by drawing inspiration from the brain which is the only known intelligent system that excels in multi-task learning.
Task Interference in MTL:
Although different lines of work such as architecture modifications [5, 6], task grouping [7, 8, 9], or task loss balancing [10, 11, 12, 13] address task interference, structural learning has not been widely studied. Learning in the brain, in addition to changes in synaptic strength, also involves structural changes. Instead of using static architectures, Guo et al. [14] and Lu et al. [15] propose methods to learn the multi-task architecture. Guo et al. [14] start from a dense search space where a child layer is connected to a plurality of parent layers. During learning, a distribution over parent nodes is learned with the aid of path sampling. At the end of training, a valid network path is picked and using neuron removal, the neurons no longer a part of the valid path are removed. However, the method of Guo et al. [14] does not involve progressive neuron removals at different intervals during training. Lu et al. [15] use neuron creation where tasks are split into different branches starting from the output layer to the input layer using inter-task affinities defined based on task error margins. Contrary to Lu et al. [15], moving from a dense set of neurons to a sparse architecture is likely more similar to structural learning in the brain.
This application refers to various references. Discussion of such references are given for more complete background and is not to be construed as an admission that such references are prior art for patentability determination purposes.
It is an object of the current invention to correct the shortcomings of the prior art and to provide a solution for efficient multi-task structural learning in artificial neural networks. This and other objects which will become apparent from the following disclosure, are provided with a computer-implemented method for learning of a plurality of tasks in artificial neural networks, a computer-readable storage, and an autonomous vehicle, having the features of one or more of the appended claims.
In a first aspect of the invention, the computer-implemented method for learning of a plurality of tasks in artificial neural networks comprises the steps of:
Additionally, the method comprises a task learning phase comprising the steps of:
Advantageously, the step of maximizing similarity among task nodes by aligning learned concepts of said task nodes comprises the step of locally increasing a similarity in said learned concepts by gauging a similarity between features of said task nodes, representing the local activity of a task, using a similarity metric such as Centered Kernel Alignment.
In fact, the method of the invention utilizes two neural operators namely neuron creation and neuron removal to aid in structural learning. In early development, the brain has excess neurons that can provide a rich information pipeline enabling neural circuits to undergo pruning and to functionally specialize. Likewise, the method of the invention creates excess neurons by starting from a disparate network for each task. Through the progress of training, corresponding task neurons in a layer pave the way for a specialized group neuron leading to a structural change.
Suitably, the task learning phase comprises the steps of:
More suitably, the step of training the entire network to minimize a multi-task loss comprises the step of minimizing the task branch, including the task node, only on the corresponding task loss independently of other tasks.
Furthermore, the step of creating neurons based on local task similarity comprises the steps of:
Additionally, the step of creating neurons based on local task similarity comprises the step of using knowledge learned in the task nodes for initializing the created group node using a two-step process, comprising:
The step of removing neurons based on local task similarity comprises the steps of:
Advantageously, the method comprises a fine-tuning phase wherein the network is only trained with the multi-task loss while skipping the step of aligning concepts learned by task nodes.
More advantageously, the method comprises the step of alternating between the task learning phase and the structural learning phase for a plurality of times before starting the final fine-tuning phase.
In a second embodiment of the invention, the computer-readable storage is provided with a computer program wherein when said computer program is loaded and executed by a computer, said computer program causes the computer to carry out the steps of the computer-implemented method according to any one of the aforementioned steps.
In a third embodiment of the invention, an autonomous vehicle comprises a computer loaded with a computer program wherein said program is arranged for causing the computer to carry out the steps of the computer-implemented method according to any one of the aforementioned steps.
Objects, advantages and novel features, and further scope of applicability of the present invention will be set forth in part in the detailed description to follow, taken in conjunction with the accompanying drawings, and in part will become apparent to those skilled in the art upon examination of the following, or may be learned by practice of the invention. The objects and advantages of the invention may be realized and attained by means of the instrumentalities and combinations particularly pointed out in the appended claims.
The accompanying drawings, which are incorporated into and form a part of the specification, illustrate one or more embodiments of the present invention and, together with the description, serve to explain the principles of the invention. The drawings are only for the purpose of illustrating one or more embodiments of the invention and are not to be construed as limiting the invention. In the drawings:
Whenever in the FIGURES the same reference numerals are applied, these numerals refer to the same parts.
A method according to an embodiment of the present invention utilizes two neural operators namely neuron creation and neuron removal to aid in structural learning. In early development, the brain has excess neurons that can provide a rich information pipeline enabling neural circuits to undergo pruning and to functionally specialize. Likewise, the method of the invention creates excess neurons by starting from a disparate network for each task. Through the progress of training, corresponding task neurons in a layer pave the way for a specialized group neuron leading to a structural change.
Enabling Neuron Creation and Neuron Removal:
A method according to an embodiment of the present invention relies on local task similarity to drive group neuron creation and the removal of the corresponding task neurons. The learned convolutional filters in different task branches might not align one-on-one due to the permutation invariance of convolutional neural networks [16]. Existing works [17, 18, 19, 20] use different ways to align corresponding layers of two models to counteract the permutation invariance. MTSL uses Centered Kernel Alignment (CKA) [21] to align neurons based on representation similarity. Knowledge amalgamation approaches [17, 22, 23, 24, 25] address distilling the knowledge from multiple learned teachers into a single student. Ye et al. [26] create task-specific coding at a layer in the student network using a small network for feature distillation. MTSL uses this feature distillation process to exploit the knowledge of the task neurons set to be removed.
In structural learning the multi-task learning architecture and its parameters are learned simultaneously. Given the set of T tasks that each has its own single network with L layers, the computer-implemented method according to the current invention results in a single multi-task network capable of inferring all the T tasks accurately without any need for retraining.
Definition of terminologies: a node is a layer that connects one branch to another branch (or to a node) and a branch is a sequence of layers that follow a node.
Initially, the first layer of each single task network is the task node while the rest of the task network, excluding the task head, is called the task branch. Similarly, a group of tasks will have a group node and a group branch. A task node is of particular significance to the method of the current invention as tasks can only be fused at the task node. Also, only the task nodes which are connected to the same group branch or to the same group node can be fused. At the start of the training, all task nodes are connected to the input image and can be fused. The schematic in
Aligning Task Specific Representations:
As is evident from the schematic of
CKA is used to measure the similarity between two feature representations and has been shown to provide meaningful similarity scores. During training, a CKA-based regularization term is introduced between task nodes branching from the same group node/branch (or the input). This regularization term, as shown in the alignment part of the schematic in
The overall loss that is used for training in the task learning phase of the computer-implemented method according to the current invention comprises two terms. The first term represents the multi-task loss which is a weighted sum of all individual task losses. The second term is the CKA regularization term which is included with a balancing factor lambda and with a negative sign to maximize alignment between tasks.
Creating Group Nodes:
The overall loss used during the task learning phase leads tasks to learn similar features while also minimizing the concerned task loss. Next, the method of the invention starts the structural learning phase to first leverage neuron creation. In the brain, local neuronal activity can affect the structure of the neural circuitry [29] and play a role in learning experiences [30]. Taking cues from these notions of locality, the computer-implemented method of the invention comprises the step of using CKA to gauge a similarity between task node features that represent the local activity of task neurons. These local task similarities are used to induce the creation of group nodes.
First, CKA between all pairs of task node features is calculated after which all possible groups of task nodes are listed. From these groups, a set of groups that maximizes the total similarity is picked and the groups that satisfy a minimum similarity induce the creation of a group neuron. For instance, in the schematic, we see that the picked groups are [T1, T2] and T3 assuming that the total number of tasks is three.
After grouping the task nodes, a group node is created for each group. The learned knowledge in the task neurons is used to initialize the created group node using a two-step process:
The knowledge amalgamation objective LKA is provided in the above equation assuming that there are N tasks grouped together. ATTnet denotes the attention network consisting of two linear layers with an intermediate ReLU activation and a final sigmoid activation.
Removing Task Neurons:
Starting from a dense set of neurons the computer-implemented method of the invention provides the opportunity to leverage a rich information flow originating from diverse task information. Using neuron removal, the method of the invention moves towards a sparser architecture by removing task nodes that learn similar representations. These locally similar task nodes become redundant once they transfer their knowledge to the group node. The task branch is then disconnected from these redundant task nodes and connected to the group node. As defined in the problem setup, the neurons from the task branch that now connect to the group node become the task nodes. These changes are evident in the depicted next state in the schematic of
Algorithm:
The following pseudo code presents the different phases involved in the computer-implemented method of the current invention, namely a task learning phase, a structural learning phase, and a fine-tuning phase. The task learning phase and the structural learning phase occur alternatively for n number of times followed by the final fine-tuning phase.
In the task learning phase, the entire network is trained to minimize the multi-task loss and to maximize similarity among task nodes. The structural learning phase involves neuron creation and neuron removal. E_t determines the number of epochs for which each subsequent task learning phase is executed. Similarly, E_s determines the epochs for ATT-based knowledge transfer. Considering a total training budget of E epochs, the task learning phase is executed up to E−f epochs where f is the minimum epochs allocated for the fine-tuning phase during which the task nodes are no longer forced to align. In the fine-tuning phase, the network is only trained with the multi-task loss.
Embodiments of the present invention can include every combination of features that are disclosed herein independently from each other.
Typical application areas of the invention include, but are not limited to:
Although the invention has been discussed in the foregoing with reference to an exemplary embodiment of the method of the invention, the invention is not restricted to this particular embodiment which can be varied in many ways without departing from the invention. The discussed exemplary embodiment shall therefore not be used to construe the append-ed claims strictly in accordance therewith. On the contrary the embodiment is merely intended to explain the wording of the appended claims without intent to limit the claims to this exemplary embodiment. The scope of protection of the invention shall therefore be construed in accordance with the appended claims only, wherein a possible ambiguity in the wording of the claims shall be resolved using this exemplary embodiment.
Variations and modifications of the present invention will be obvious to those skilled in the art and it is intended to cover in the appended claims all such modifications and equivalents. The entire disclosures of all references, applications, patents, and publications cited above are hereby incorporated by reference. Unless specifically stated as being “essential” above, none of the various components or the interrelationship thereof are essential to the operation of the invention. Rather, desirable results can be achieved by substituting various components and/or reconfiguration of their relationships with one another.
Optionally, embodiments of the present invention can include a general or specific purpose computer or distributed system programmed with computer software implementing steps described above, which computer software may be in any appropriate computer language, including but not limited to C++, FORTRAN, ALGOL, BASIC, Java, Python, Linux, assembly language, microcode, distributed programming languages, etc. The apparatus may also include a plurality of such computers/distributed systems (e.g., connected over the Internet and/or one or more intranets) in a variety of hardware implementations. For example, data processing can be performed by an appropriately programmed microprocessor, computing cloud, Application Specific Integrated Circuit (ASIC), Field Programmable Gate Array (FPGA), or the like, in conjunction with appropriate memory, network, and bus elements. One or more processors and/or microcontrollers can operate via instructions of the computer code and the software is preferably stored on one or more tangible non-transitive memory-storage devices.
Number | Date | Country | Kind |
---|---|---|---|
NL2032650 | Aug 2022 | NL | national |