This application claims priority to and benefit of United Kingdom Application No.: GB2317986.4, filed on Nov. 24, 2023 and titled “Computer-implemented method for training a multi-task neural network.” The contents of the above-identified application are relied upon and incorporated herein by reference in their entirety.
The invention relates to a computer-implemented method for training a multi-task neural network. The invention further relates to a computing and/or controlling device, a computer program, and a computer-readable storage medium.
A deep neural network model which has a suitable architecture and is trained by multitask learning algorithms can perform multiple tasks at the same time. Most recent works have investigated these two components from different perspectives and designed many deep neural network architectures and multitask training algorithms in a lot of domains.
However, in general, multitask neural networks are difficult to train. The core difficulty of the multi-task learning problem is that different tasks require capturing different levels of information and then to exploit them to varying degrees.
However, due to the black box nature of deep neural networks, it may be hard to know what features or information the model must extract from the input data for each task and how the model learns such feature extraction process.
The object of the invention is to provide an improved method for training a multi-task neural network.
The object of the invention is achieved by the subject-matter of the independent claims. Advantageous embodiments of the invention are subject-matter of the dependent claims.
In one aspect, the invention provides a computer-implemented method for training a multi-task neural network for predicting a plurality of T tasks, T≥2, simultaneously based on input data, the method comprising:
An advantage of the method may be that it can be used for training any neural network, e.g., a multilayer perceptron, a convolutional neural network, a deep neural network, for predicting a plurality of T tasks, T≥2, simultaneously.
The training dataset and the validation dataset used to meta-train the neural network may be both split from an original dataset. Thus, in the precent case, the term “validation dataset” may be different from the widely-used term in deep learning which is only used for evaluating the performance or finding the hyperparameters.
A further advantage of the method may be that it can align or balance the training speed of the plurality of T tasks having different levels of a learning complexity. Furthermore, the method may improve training of the multi-task neural network such that an averaged prediction accuracy across several or all T tasks and/or selected prediction accuracies for predicting one or more selected tasks are maximized. Additionally, or alternatively, the method may further improve training of the multi-task neural network such that selected prediction accuracies for predicting one or more selected tasks are similar in scale.
A further advantage of the method may be that it can decrease the size of the trained multi-task neural network.
Preferably, the respective single-task loss function Lt for the corresponding task t=1, . . . , T is validated in the respective training epoch on the corresponding single-task control parameter θt′ predicted in said training epoch and on inputs of the validation dataset ′ related to said task.
Preferably, c1) the optimization neural network Optimϕ includes the multi-task neural network that is trained across the predefined number Nepoch of training epochs such that the combined loss function is minimized within the predefined number Nepoch of training epochs.
An advantage of the method may be that the optimization neural network Optimϕ can be used for both, as the multi-task neural network itself and as the neural network for predicting the corresponding single-task control parameter θt′. This may reduce the computation resources needed for training.
Preferably, c2) the optimization neural network Optimϕ is parameterized with an optimization parameter θt′ that is adapted in the respective training epoch based on the optimization parameter θt′ of the corresponding previous training epoch and a gradient descent of the combined loss function.
Parameterization of the optimization neural network Optimϕ similarly may have the advantage that the optimization neural network Optimϕ can be used for both, as the multi-task neural network itself and as the neural network for predicting the corresponding single-task control parameter θt′.
Preferably, the optimization neural network Optimϕ is meta-trained in the respective training epoch for predicting the corresponding single-task control parameter θt′ in said training epoch based on inputs of the training dataset related to said task t=1, . . . , T inputted to the optimization neural network Optimϕ.
The training dataset and the validation dataset
′ may represent joint datasets, that are combined according to the different tasks to be learned. For example, the training dataset
and the validation dataset
′ may be joint datasets with multiple labels y0, . . . ,yT for learning an object detection task, a semantic segmentation task, and/or a depth estimation task. Thus, the training dataset
and the validation dataset D′ may be adapted according to the needs and use case.
Preferably, the optimization neural network Optimϕ is trained for generating a multiple-task control parameter θi for the respective training epoch based on the multiple-task control parameter θi-1 for the corresponding previous training epoch and/or inputs of the training dataset related to each of the plurality of T tasks inputted to the optimization neural network Optimϕ.
The multiple-task control parameter θi may be used as a common measure over all T tasks based on which the corresponding single-task control parameter θt′ is predicted. The multiple-task control parameter θi may further be used as a common measure over all T tasks based on which the distance metric function d is evaluated. Thus, the role of multiple-task control parameter θi may be multi-functional which may reduce the computation resources needed for training. This may provide an improved method for simultaneous learning of the different tasks.
Preferably, the optimization neural network Optimϕ is meta-trained in the respective training epoch for predicting the corresponding single-task control parameter θt′ in said training epoch based on the multiple-task control parameter θi-1 of the corresponding previous training epoch inputted to the optimization neural network Optimϕ.
Preferably, the respective distance metric value for the respective training epoch is calculated in said training epoch by evaluating the distance metric function d on the multiple-task control parameter θi for said training epoch.
Preferably, the optimization neural network Optimϕ is meta-trained for one or more meta-training steps in the respective training epoch and/or the optimization neural network Optimϕ is meta-trained in said training epoch for predicting the single-task control parameter θt′ for the final training epoch.
An advantage of the method may be that is the optimization neural Optimϕ is trained for predicting the single-task control parameter θt′ for the final training epoch. Thus, in each training epoch, the training of the different tasks is aligned in a perspective manner with respect to the final training epoch. This may improve the training of the multi-task neural network.
Preferably, the combined loss function and/or the single-task loss functions Lt are based on one, several, or all of the following: a focal loss, a cross-entropy loss, and a bounding box regression loss.
Preferably, the distance metric function is based on one, several, or all of the following: a Kullback-Leibler (KL) divergence metric, a CKA distance metric, and an L2 distance metric.
Preferably, the regularization function is based on a norm function of the difference, the norm function being preferably an absolute value function.
Preferably, the method further comprises:
The controlling device may be implemented any kind of apparatus, e.g., in an autonomous vehicle. The input data may thus relate to sensor data of said apparatus of autonomous vehicle. In an autonomous vehicle, the controlling device may generate control signals for navigating the vehicle. According to the method, the multi-task neural network may thus be trained for predicting one or more navigation tasks.
In another aspect, the invention provides a computing and/or controlling device comprising means adapted to execute the steps of the method according to any of the preceding embodiments.
In another aspect, the invention provides a computer program comprising instructions to cause a computing and/or controlling device to execute the steps of the method according to any of the preceding embodiments.
In another aspect, the invention provides a computer-readable storage medium having stored thereon the computer program.
Embodiments of the invention preferably have the following advantages and effects:
The proposed methods may be individually packaged as a software module to be added to existing deep learning or deep multi-task training pipelines for model understandings and for enhanced joint performance. The technologies involved in preferred embodiments of the invention may also be widely applied to deep learning applications in the wild with distinctive training data distributions, such as multi-class, multi-modal, multi-label, and multi-domain learning.
The primary application of preferred embodiments of the invention is for deep multi-task model understandings and training dynamics optimization for perception tasks. Specifically, the technical problem that preferred embodiments of the invention is focused to solve is two-folds:
The key technical innovations that set preferred embodiments of the invention apart from and offer advantages over prior art include:
To achieve above idea, preferred embodiments of the invention provide a metric to define and measure the training stage of each task and a meta-learning based optimization algorithm to implement a training process which keep aligned training stages for all tasks.
Embodiments of the invention are now explained in more detail with reference to the accompanying drawings of which
The first case 14a from the left indicates prediction accuracies 15 of a single-task neural network for predicting the first task 10 or the second task 12, respectively.
The neural network is trained for either the first task 10 or the second task 12, respectively.
The second case 14b indicates the prediction accuracies of a multi-task neural network for predicting the first task 10 and the second task 12 simultaneously. The multi-task neural network is trained simultaneously for both, the first task 10 and the second task 12, according to known methods.
The third case 14c indicates prediction accuracies of the multi-task neural network for predicting the first task 10 and the second task 12 simultaneously. In contrast to the second case 14c, the multi-task neural network is trained successively for both tasks 10, 12, starting with the first task 10.
The fourth case 14d indicates prediction accuracies of the multi-task neural for predicting the first task 10 and the second task 12 simultaneously. In contrast to the third case 14c, the multi-task neural network is trained starting with the second task 12.
As can be derived from
An idea of preferred embodiments of the invention is to provide an optimized scheme for training a multi-task neural network such that the prediction accuracies for predicting the first task 10 or the second task 12, respectively, are preferably similar in scale. Furthermore, the averaged prediction accuracy for both tasks 10, 12 should be maximized. To achieve this object, a further idea of preferred embodiments of the invention is to align a training speed of different tasks in a multi-task neural network.
, parameter θ0, number of total
do
In a step S11, a single-task neural network Mθ for predicting a single task and a training dataset for training the neural network Mθ on the task are provided. The single-task neural network Mθ is parameterized with a model parameter θ that is initialized with θ0. The neural network Mθ further includes a plurality of Nlayer network layers.
In a step S12, the neural network Mθ is trained by using the training dataset such that a loss function L is minimized within a predefined number Nepoch of training epochs. In a respective training epoch i=1, . . . , Nepoch, the model parameter θi is adapted or updated based on the model parameter θi-1 of the previous corresponding training epoch and a gradient descent of the loss function L. In this way, for each training epoch i=1, . . . , Nepoch, the neural network Mθ
In a step S13, for each training epoch i=1, . . . ,Nepoch and for each network layer j=1, . . . ,Nlayer, the loss function L of a joint neural network is recorded.
The joint neural network is merged by a front model Mfront and a back model Mback. Therefore, the neural network Mθ
The front model Mfront for a respective training epoch i=1, . . . , Nepoch and for a respective network layer j=1, . . . ,Nlayer represents then the neural network Mθ
The back model Mback for a respective network layer j=1, . . . , Nlayer represents the neural network
for the final training epoch from the network layer after the clipping position Nclipping =[1, . . . , Nlayer −1] until the last network layer.
The joint neural networks have been each trained with a differently modified CIFAR100 training datasets The CIFAR100 training dataset have been modified by adding different levels of noise thereto, which simulates different levels of a learning complexity of the task.
The first line of the diagram of
The second, third, and fourth line of the diagram of
As can be derived from
, validation dataset
′,
do
do
← Σt=1T Σt=2T ||d(θi−1, θ′t
+
)
In a step S21, the first embodiment includes:
The training dataset D and the validation dataset ′ represent jointly labelled datasets, where each data (e.g., picture) has multiple labels y0, . . . ,yTcorresponding to the labels of all T tasks. The multiple labels y0, . . . ,yT may be used for training different tasks (e.g., object detection, semantic segmentation, depth estimation, etc.).
In a step S22, the first embodiment includes:
The combined loss function for a respective training epoch depends on a sum of a plurality of single-task loss values for the respective training epoch and respectively from a single-task loss function Lt for a corresponding task t=1, . . . , T.
The respective single-task loss function Lt for the corresponding task t=1, . . . , T is validated on a corresponding single-task control parameter θt′ predicted in said training epoch and inputs of the validation dataset ′ related to said task. The corresponding single-task control parameter θt′ is predicted in the respective training epoch by an optimization neural network Optimϕ that is meta-trained in said training epoch in relation to the corresponding task t1, t2=1, . . . , T.
The optimization neural network Optimϕ may be meta-trained for one or more meta-training steps in the respective training epoch. The number of meta-training steps in the respective training epoch may be variable and/or depending on the number of training epochs that are remaining from said training epoch until the Nepoch-th training epoch is reached. In other words, the optimization neural network Optimϕ may be meta-trained for predicting the single-task control parameter θt′ for the final training epoch.
The combined loss function for the respective training epoch further depends on a regularization value for the respective training epoch and from a regularization function for all pairs of tasks t1, t2=1, . . . , T.
The regularization function for a respective pair of tasks t1, t2=1, . . . , T is based on a difference between a corresponding pair of distance metric values respectively from a distance metric function d.
Each respective distance metric value for the respective training epoch is calculated in said training epoch by evaluating the distance metric function d on the corresponding single-task control parameter θt′ predicted in said training epoch.
The combined loss function and/or the single-task loss functions Lt may be based on one, several, or all of the following: a focal loss, a cross-entropy loss, and a bounding box regression loss.
The distance metric function d may be based on one, several, or all of the following: the Kullback-Leibler (KL) divergence metric 16, the CKA distance metric 18, and the L2 distance metric 20.
The regularization function may be based on a norm function of the difference, the norm function being preferably an absolute value function.
Reference is now made to the pseudo-code B:
In other words, the optimization neural network Optimϕ is parameterized with an optimization parameter ϕ. The optimization parameter ϕ is adapted or updated in the respective training epoch based on the optimization parameter ϕ of the corresponding previous training epoch and a gradient descent of the combined loss function.
Furthermore, the optimization neural network Optimϕ is meta-trained in the respective training epoch for predicting the corresponding single-task control parameter θt′ in said training epoch based on inputs of the training dataset related to said task t=1, . . . , T inputted to the optimization neural network Optimϕ.
Furthermore, the optimization neural network Optimϕ is trained for generating a multiple-task control parameter θi, i=1, . . . , Nepoch, for the respective training epoch based on the multiple-task control parameter θi-1 for the corresponding previous training epoch and/or inputs of the training dataset related to each of the plurality of T tasks inputted to the optimization neural network Optimϕ.
Furthermore, the optimization neural network Optimϕ is meta-trained in the respective training epoch for predicting the corresponding single-task control parameter θt′ in said training epoch based on the multiple-task control parameter θi-1 of the corresponding previous training epoch inputted to the optimization neural network Optimϕ.
Furthermore, the respective distance metric value for the respective training epoch is calculated in said training epoch by evaluating the distance metric function d on the multiple-task control parameter θi for said training epoch.
Furthermore, the regularization function is based on the absolute value function.
In a step S31, the second embodiment includes:
In a step S32, the second embodiment includes:
In a step S33, the second embodiment includes:
Preferred embodiments of the invention may be summarized as follows:
Preferably, we use the term front-model to refer to the first half of the model separated from a layer, and back-model to refer to the second half. The front-model may be a mapping function from the input space to the hidden space, and the back-model may be a function from the hidden space into the label space. An idea is that if a front-model has acquired an enough strong feature extraction ability, there may be a very little performance gap compared with the final trained model when both using the back-model from the final model. It should be emphasized that this concept may primarily apply to the backward perspective, where the ultimate model is the outcome of training derived from the evaluated model. In instances where other parameter initializations or alternative training algorithms are employed, the final back-model may require an equivalent representation of the feature and cause a significant performance drop.
We provide our algorithm in pseudo-code A and results are shown in
Preferably, the noisy CIFAR100 datasets are constructed from the original CIFAR100 datasets, but we preferably replace some correct labels from training data to complete random noise.
As shown in
In the case of multi-tasking, it may be hard to directly transfer the back-model from the single-task training to compare the training stages of all tasks, because the output of the back-model may be fundamentally different in different tasks. From the perspective of the data, the output may be in a different label space. From the perspective of the model, they may use separate task headers. Here, we suggest to use preferably the distance between features generated by the shared backbone. The back-model is regarded as the second half-part of the shared backbone, not including the task specific head. And extra training on each task may be needed, which starts from the evaluated multitask model in every training step until getting enough good single task models which are used as back-models. Then we can follow the above process to obtain the descriptions of all tasks for a multitask learning model.
Preferred embodiments of the invention provide a meta-learning approach to keep an aligned training stage during multitask training process. We first preferably use the difference between tasks' training stages to design a penalty term, e.g., the absolute value of the difference in distances for all layers and all tasks. However, since the calculation of the training stage may depend on the training algorithm, and our goal is to find a better multitask training algorithm, this may be a cyclic dependence problem.
We preferably use a meta-learning perspective, that is, we preferably use a meta-learning process to learn a neural network optimization process so that the training stage is aligned throughout the process. We preferably model this optimization process as a deep neural network as Optimϕ, and then we may train this optimization process with two major objectives: 1) this optimization process may minimize the original loss value for all tasks in a standard training time. 2) the penalty term is small. According to these two requirements, we provide our meta-learning algorithm in pseudo-code B.
In the actual implementation, we preferably use the neural network as a loss function and combine with standard gradient descent algorithm to form the whole optimization process, i.e., Optimϕ (θ, x,y)=θ−α∇Lϕ(ƒ(θ(x),y). And we preferably do not actually train the single task network until optimal (getting the best θj′), but instead use the model after multi-step stochastic gradient descent to greatly reduce computation.
The invention also provides a computing and/or controlling device comprising means adapted to execute the steps of the described method for training a multi-task neural network for predicting a plurality of T tasks, T≥2, simultaneously based on input data. The invention further provides a computer program comprising instructions to cause the computing and/or controlling device to execute the steps of the described method. The invention further provides a computer-readable storage medium having stored thereon the computer program.
| Number | Date | Country | Kind |
|---|---|---|---|
| GB2317986.4 | Nov 2023 | GB | national |