The present invention relates to multi-task networks.
Deep multi-task learning methods is a subset of deep learning methods that aim at exploiting similarities between several tasks in order to improve individual task performance.
Deep multi-task learning methods have recently shown excellent results in diverse applicative fields among which are face analysis, image segmentation, speech synthesis, and natural language processing [3].
The most widely adopted method generally relies on implicit modelling of task dependencies using weight sharing. It consists in splitting the model into parts that are shared across tasks and parts that are task specific.
Seminal work [7] adopted this weight sharing strategy by making use of a common encoder along with task specific regressors. The intuition behind this is that forcing the same learned features to predict several related tasks should encourage more general representations and consequently improve generalization performances.
However, one weakness of this method comes with deciding how far features should be shared as it intuitively depends on task relatedness (the more related the tasks the further the sharing) which can be hard to determine. To tackle this issue, several approaches such as [8] used adaptive architectures to jointly learn which layers should be shared between tasks, as well as the task prediction itself. This philosophy has also been used in modular approaches which consist in learning a set of trainable modules along with how they should be combined for each task: [7] used soft layer ordering to learn the best block combination for each task in a fully differentiable way.
In this work, the method uses a set of D modules and learns to combine them in different orders for each task. It results in T different representations (one for each task). Then each representation is fed to different dense layers to obtain T predictions (one per task) in parallel. More specifically, it is explained in section 7 of this article that the soft layer ordering method can be seen as a new type of recurrent network in the sense that they reuse the same D modules at different depth of the architecture. More formally the representation propagation defined in eq 7 of D1 can be rewritten as yik=G(yik-1,sik). As such the current representation recursively depends on previous representations.
Weight sharing may help finding features that are useful for all tasks and therefore implicitly models input-related task conditional dependencies, though, it doesn't capture inter-task relationships that do not depend on input (e.g., the prior that detection of a beard implies high probability that the subject also has a mustache). In order to model those dependencies, several approaches such as [1] leveraged recurrent neural networks to decompose the task joint distribution into a product of conditional distributions using Bayes chain rule. [12] argued that the order in which the chain rule is unrolled impacts the final joint estimate modelization performance. In the light of this observation, they proposed a two-steps method: The first step consists in an exploration phase in which several orders are explored. At the end of this phase, a single order is fixed once and for all based on the exploration phase performance and predictions are computed using this order.
The most widely adopted deep multi-task approach is to model task dependencies using weight sharing only [3], i.e., to assume labels conditional independence given the input image. Template networks for this approach are composed of a shared encoder fW parametrized by matrix W along with a specific prediction head gW
In the following, scalars are denoted using regular characters, vectors are in bold. For vector v the t-th coordinate is denoted by v(t), and the vector of its t−1 first coordinates is denoted by v(<t).
Let D={(xi, yi)}iN be a training dataset composed of input vectors x and binary labels y of size T, such that ∀t∈[1, T], y(t)∈{0,1} where T is the number of tasks.
One refers to instances of such template as Vanilla Multi-task Networks (VMN), as shown in
Giving input x, the prediction for task t is the output of the t-th prediction head:
If VMN are the most classic multi-task learning approach, their modelization assumption do not account for inter-task relationships that are independent of the input image. Those relationships include numerous human knowledge-based priors such as the statistical dependency between the presence of a beard and the presence of a mustache, for instance. Naturally, however, predictive performance of a deep network may benefit from exploiting these priors.
Different from VMN, Multi-task Recurrent Neural Networks (MRNN) model inter-task dependencies through both weight sharing and joint conditional distribution modelization, as shown in
For that purpose, the joint conditional distribution of labels is decomposed using Bayes chain rule:
The MRNN approach is to encode input vectors with network fW and to feed the output representation as the initial state h(0) of a recurrent computation process driven by cell gV with parameters V. At step t, this process takes one hot encoded ground truth for timestep t−1 task along with hidden state h(t-1) and outputs prediction p(t) for timestep t and next timestep hidden state h(t). In a nutshell:
Task t conditional distribution w.r.t previous tasks is estimated as:
The performances of MRNN are directly linked to the conditional joint distribution estimate modelization performance that itself depends on the modelization performance of each element in the chain rule product. In [12], it is established that order matters, meaning that unrolling chain rule in a different task chaining order leads to different modelization performance. This comes from the fact that tasks may be easier to learn in a given order. By relying on a single arbitrary chain rule decomposition order, MRNN misses the opportunity to better exploit the inter-task relationships.
There remains a need for improving performances of multi-task networks, and exemplary embodiments of the present invention relate to a computer-implemented method for training a multi-task network comprising at least one recurrent network having:
Thanks to the invention, the network can smoothly select the best order and significantly outperforms existing multi-task approaches, as well as state-of-the-art methods for real-world multi-task problems. The network achieves superior performances by jointly optimizing task ordering and predictions.
By optimizing the task order, the method according to the invention takes advantage of situations where order matters. A further advantage of the invention is that, by learning more than one task order all along, the training improves the network's generalization capacity, leading to competitive performance, even in situation where orders do not significantly matter.
Contrary to what is described in [7], the invention uses T different modules, one for each task, and learn to sequentially predict each task based on the previously predicted tasks and the current extracted representation. To do so, the network jointly learns the tasks and the order in which they should be predicted. Therefore, contrary to [7], the prediction of each task benefits from the information of all previously predicted tasks. On the contrary, the invention does not share modules between tasks but use the same module for each task in different orders. For a given task and a given order, the prediction is computed using a specific task module based on the recursively propagated hidden state that embed the information of all previous predicted tasks, as shown in a section below by the computational graph used in the present invention.
By “task specific cell”, it is meant that each cell is trained to predict the same task across all orders.
The tasks may be of various types and the method according to the invention is not limited to a particular type of task, or a particular category of tasks.
In some embodiments, the tasks may be selected among face attribution classification, for example attributes related to the prediction of gender, such as mustache, beard, heavy makeup and sex, or attributes related to the detection of accessories, for instance earrings, eyeglasses, necklace. . . . The attributes may be features representing face attractiveness, such as arched eyebrows or high cheekbones, or may be related to haircut.
In other embodiments, the tasks may be related to facial action unit detection or any other multi-task binary classification
The tasks may also be selected among other families, such as categorical classification, regression or even unsupervised problem.
By “differentiable order selector”, it is meant that the network may learn the best order in a differentiable way. The method according to the invention extends MRNN by parallelly estimating the joint conditional distribution using different orders and smoothly selecting the best estimate.
The order selector may comprise a softmax layer over logits u.
The order selector may determine the combination of the different possible orders in different ways.
Preferentially, the order selector determines the convex combination of different possible orders based on a soft order modelling inside Birkhoff's polytope.
Such a modelling is described below. Let's define a soft order of T tasks as any real doubly stochastic matrix Ω of size T×T, i.e., a matrix such that:
Intuitively, in such case, the coefficient Ωi,j associated to each row i and column j in Ω corresponds to the probability to address task j at step i. Therefore, in the extreme situation where all columns are one-hot vectors, a soft order matrix becomes a “hard” order (i.e., a permutation matrix) that models a deterministic task order. More precisely, if σ denotes a permutation, its associated order matrix is:
The Birkhoff-Von Neumann's theorem states that the class of doubly stochastic matrices (also called Birkhoff's polytope) is the convex hull of all the order matrices.
In other words, any soft order matrix Ω can be decomposed as a convex combination of M order matrices. Formally, there exists M a finite number, (π1, π2, . . . , πM)∈R, and Mσ
Therefore, each soft order of T tasks can be parametrized by the coefficients (π1, π2, . . . , πM) associated to each possible order matrices, with M=T!. The reciprocal is also true: given M order matrices, with M≤T!, each convex combination (π1, π2, . . . , πM) also defines a soft order.
The method according to the invention is a multi-task learning method with joint task order optimization, which means that the task order and the prediction are modeled in an end-to-end manner.
The exemplary soft order modelling described above can be used to provide a differentiable parametrization of soft orders, that will allow to jointly learn both the task order and prediction by smoothly navigating within Birkhoff's polytope. To do so, M≤T! random permutations are first generated, denoted as (σm)m∈[1,M]. For each permutation σ, a joint distribution convex pσ is generated by unrolling the Chain Rule in order σ:
Finally, a final joint distribution is computed as a convex combination of each permutation-based joint distribution:
By learning the order selector coefficients (π1, π2, . . . , πM), the search of soft order Ω is constrained inside a convex subset of Birkhoff polytope which is the convex hull of matrix orders Mσ
The formulation described above may be used in other ways. For example, by keeping the coefficients frozen during the training phase, one could learn all orders and output the prediction as the unweighted mean of all order predictions.
The training may comprise joint training of a shared encoder with the training of the order selector and task-specific cells of the recurrent network.
The task-specific cells are preferentially recurrent cells, which allows to capture the order information. Such recurrent cells may be of different types, in particular GRU type, or LSTM type.
The multi-task network may be composed of T task-specific recurrent cells (gW
For an order σ, task σ(t) is predicted using recurrent cell gW
The rationale behind this comes from traditional RNN usage, where each cell predicts a single task in different contexts. (e.g., RNN-based sentence translation is a repetition of word translations conditioned by the context of neighboring words). Here, each task-associated predictor learns to predict the corresponding task in different contexts corresponding to the different orders.
Concretely, for order σm, the computational graph unfolds as follows:
The prediction pm(t) at timestep t is computed as:
Hence, parameters θ=(W, (W(t)1≤t≤T) as well as the order selector, defined as a softmax layer over logits u, are jointly learned through minimizing a loss function. Preferentially, the training is performed to minimize the maximum likelihood-based loss function defined as:
The above-described architecture is preferential when logits u does not depend on the input vector x, that is, if there exist an order, or several orders, that perform best for all input vectors x.
Other architectures are possible, in particular to account for the case where the best order(s) depends on the input. For instance, logits u may be predicted from input x using a fully dense network with rectified linear unit (ReLU) activations.
The method according to the invention is not limited to a task order selection mechanism, and the corresponding architecture that has just been described.
In other embodiments, for instance, the task space may be partitioned in several blocks of conditionally independent tasks and the differentiable order selector may determine a soft order of blocks during training. Such a partitioning into blocks greatly improves the scalability of the method by allowing the network to handle a large number of tasks.
The smooth selection performed by the order selector according to the invention contrasts with [12]'s once and for all choice of order by keeping on learning several orders during the training phase.
In some embodiments, an order dropout mechanism may be added on top of the soft order modelling. Such a mechanism encourages modules to display good predictive performances for several orders of prediction and consequently bolsters modules generalization capacity.
A further advantage is that it may prevent order overfitting as well as save computational runtime.
Hence, preferentially, the order selector performs an order dropout during at least part of the training.
The above-described order selection mechanism during training is based on the evolution of the order selector coming from the minimization of the loss function using gradient descent techniques. Intuitively, this tends to assign higher order selection coefficient to the orders that already have the lowest losses.
This observation comes with issues at different stages of the training:
When the training begins, the order losses mostly depend on network initialization, rather than on their respective performance.
Hence, order selection in the first epochs may likely lead to quasi-random solutions. This may be all the more problematic as weight allocation is prone to snowballing, i.e., to keep allocating more and more weight to a previously selected order.
To circumvent this issue, the method may comprise freezing the order selector during a warm-up phase so that all M task orders are given an identical weight.
Therefore, during the warm-up phase, the order logits u may be frozen for the first n epochs to give the network an exploration opportunity.
The order with the lowest training loss best fits the empirical distribution. Thus, ranking orders based on their training loss may yield increased risk of overfitting.
To address this issue, the drop-out strategy may comprise training each example on a random subset of k permutations by zeroing-out (M−k) order selector coefficients, the order coefficient being defined as:
For inference, each exp(um) is then multiplied by its probability p(k, M) of presence, as in [28]:
The order dropout may thus prevent order overfitting by encouraging the network to keep accurate predictions for different orders.
Once trained, the multi-task network may generate a prediction from an input x. The prediction for an input x is the expectancy of order-based networks prediction over all orders.
Several variants are possible, corresponding to several ways of estimating this expectancy.
In particular, an aspect of the present invention relates to a computer-implemented method for performing multiple prediction tasks and generating a global network prediction using a multi-task network trained according to the method described above, the method comprising:
Sampling from the order selector allows to dynamically combine T recurrent cells, one for each task, in an order that has been learned during training.
Inference may thus be done by predicting all tasks in L orders sampled from the order selector and by averaging those L predictions.
The sampling may be obtained through different methods. Preferentially, the sampling is done using Monte Carlo sampling estimation.
For example, the Monte Carlo sampling estimation samples N orders and computes the average of the prediction of the N corresponding order-based networks. The main benefit of this approach is that it restricts the number of parallel predictions to compute from M to N.
The expectancy can be estimated using other methods.
For instance, one may obtain the exact estimation by computing the full weighted average of the prediction of all order networks.
As a variant, the prediction could be obtained using a filtering-based estimation which would consist in restricting the weighted average of order-based network prediction to the orders with high order selector values.
The method according to the invention may be used for different applications. For instance, the method may be used for performing at least one of human face and body analysis scene analysis, speech recognition, image classification.
The method may use images as inputs. The prediction tasks may be of various nature, depending on the application.
For instance, for human face analysis, the method may be used to predict landmarks, pose, action units, attributes, emotions, identities . . . .
For human body analysis, the predictions may comprise body pose, attributes, actions . . . .
For scene analysis, the method may be used to generate segmentation maps, 3D depth maps, object bounding box . . . .
For speech recognition, the method may be used to predict the identity or emotions, or for source separation.
For image classification, the prediction tasks may consist in classifying the images into classes and subclasses, such as differentiating dogs and poodles . . . .
The method according to the invention may be implemented using any computer system.
The computer system may be a laptop, a personal computer, a workstation, a computer terminal, one or several network computers, or any other data processing system or user device, such as a microcontroller for example.
Preferentially, the computer system comprises at least one processing unit which controls the overall operation related to the method of the invention by executing computer program instructions. The processing unit may comprise general or special purpose microprocessors or both, or any other kind of central processing unit.
The computer system may comprise one or more storage media on which computer program instructions and/or data are stored. The storage media may comprise magnetic, magneto-optical disks, or optical disks, such as an internal hard drive, any type of removable storage medium or a combination of both
The computer system preferably comprises at least one memory unit that can be used to load the instructions or/and data when they are to be executed by the processing unit.
The memory unit comprises for instance a random-access memory (RAM) and/or a read only memory (ROM).
The computer system may further comprise one or more user interfaces, such as any type of display, a keyboard, a pointing device such as a mouse or trackpad, audio input devices such as speakers or microphones, or any other type of device that allows user interaction with the computer system.
The computer system may comprise one or more network interfaces for communicating with other devices.
Another aspect of the invention relates to a computer program product comprising code instructions that cause a computer system to perform the method as defined above when the program is run on the computer system.
Several exemplary embodiments will now be described with reference to the accompanying drawings, in which:
In the following description, use will be made of the notations already disclosed in reference to
More generally, as described earlier, a multi-order network in accordance with the invention may be composed of T recurrent cells 10.
Each recurrent cell is “task-specific”, that is, it is trained to predict the same task across all orders. Here, each task-associated predictor learns to predict the corresponding task in different contexts corresponding to the different orders.
In the disclosed example, there are T=3 tasks (task 1, task 2, task 3) and M=3 exemplary orders σ are shown (σ1=[1, 2, 3], σ2=[1, 3, 2] and σ3=[2, 1, 3]), though more orders are of course possible.
The multi-order network 1 comprises a shared encoder fW and an order selector 2, and the training of the network occurs with joint task order optimization.
The training relies on a differentiable order selection based on soft order modelling inside Birkhoff's polytope. Recurrent modules, one for each task, are learned with their optimal chaining order in a joint manner.
In the disclosed example, the soft order selection consists in determining the order selector coefficients π1, π2 and π3, corresponding to the first, second and third order considered, respectively.
At the inference level, to output a final prediction P comprising the three tasks, the predictions p1, p2 and p3 of the recurrent task-specific cells are weighted by a convex combination using the order selector coefficients π1, π2 and π3 determined during training.
Before weighting, the predictions may be normalized in a common order (represented by the round arrows on
On
The raw input is an image I, which is fed to an encoder fW. The tasks may be related to detection and/or prediction of some attributes of the person represented on the image. For instance, the network may detect the mustache or beard of the person and predict the gender.
At inference time, the network predicts all tasks in L different orders sampled from the order selector 2.
The final prediction P is then the average of those L predictions pi. This soft order modelling allows to jointly learn both the order selection and task prediction at train time. Thus, it can efficiently capture inter-task dependencies, leading to enhanced multi-task performance.
In another embodiment represented on
For each block order, an order selector coefficient Tri is determined, and the predictions from each block are weighted using these coefficients to generate a final prediction for each task.
Several examples are now described in reference to
The first example, hereafter referred as “Toy experiment” is an empirical validation of the method according to the invention with a 2-dimensional multi-task binary classification toy dataset.
The second example, hereafter referred as “real-scenarios” compares the performance of the method according to the invention with existing multi-task baselines in several real-world scenarios such as Attribute detection and Facial Action Unit (FAU) detection.
A 2-dimensional multi-task binary classification toy dataset represented in
For T tasks, it uses the following laws for input and labels:
Where
The interval [−1; 1]2 is vertically split in a recurrent way. This dataset has a natural order in which the tasks are easier to solve.
More precisely, task 1 vertically splits [−1, 1]2 in two zones of equal size. Examples in the left zone are positive for task 1, while the right zone hold negative examples.
Task 2 applies the same vertical splitting strategy to each of the so-formed two zones.
More generally, Task t vertically splits each of the 2t-1 zones of task t−1.
All tasks are deterministic and consequently mutually independent conditionally to X.
The best order is the order in which each task is easier to learn based on the previous one's results. The burning point of this claim is that the result of a task can simplify another task without any apparent statistical dependencies between those two. If the toy example is taken for T=2, Y2(1) can be expressed as a function of X:
Therefore, task 1 result simplifies the learning of task 2's dependency in X from the indicator of a reunion of two segments to the indicator of a single segment. Similarly, the knowledge of the t−1 first tasks simplify the learning of task t dependency in X from the indicator function of a reunion of 2t-1 segment to the indicator function of a single segment.
Hence, for t∈[1, T], conditioning task t+1 by the t first tasks results transforms it into an easy-to solve classification problem with a single linear boundary. Conversely, conditioning task t+1 by the result of upcoming tasks in coordinate order do not simplify its complexity as a classification problem, it remains a 2t−1 linear boundary problem.
500, 250 and 250 examples are generated for the train, val and test partitions, respectively. Those sizes are deliberately small to challenge networks modelization performance.
In particular,
Two particular datasets are used for the real-world scenarios.
CelebA is a widely used database in multi-task learning, composed of ˜200k celebrity images annotated with 40 different facial attributes. For performance evaluation, accuracy score is measured using the classic train (˜160k images), valid (˜20k images) and test (˜20k images) partitions for 5 different subsets: gender, accessories, haircut, beauty, miscellanous.
Each subset comprises several different attributes as described below:
Those attributes display prior statistical dependencies. For example, a beard often comes with a mustache and heavy makeup is likely to include lipstick.
The specificity of this subset is that the attributes are mutually exclusive.
DISFA is a dataset for facial action unit detection which is a multi-task problem. It contains 27 videos for 100k face images. Those images are collected from 27 participants and annotated with 12 unitary muscular activations called Action Units (AU). Originally, each AU label is an intensity score from 0 to 5. In detection, labels with an intensity score higher than 2 are considered positive [13]. For performance evaluation, one may follow related work strategy that is to report F1-Score on 8 AUs using a subject exclusive 3-fold cross-validation with publicly available fold partition from and [11].
In the experiments, the multi-order network in accordance with the invention is compared with several multi-task baselines, each using a shared encoder and a number of prediction heads.
For VMN-Common (VMNC), the prediction head consists in two dense→BN→ReLU applications followed by another dense→sigmoid layer with T outputs. VMN-Separate (VMNS) use T prediction heads, each with the same structure, except the last layer is of size 1. Finally, MRNN uses a single Gated Recurrent Unit (GRU) cell which sequentially predicts the T tasks. Task order is randomly sampled for each MRNN experiment.
The shared encoder consists of four dense layers with 64 units and ReLU activation. Prediction heads for both VMNC and VMNS consist in dense layers with 64 units.
Both MRNN and the multi-order network of the invention employ GRU cells with 64 units and L=20 orders.
All networks are trained by applying 500 epochs with Adam as described in [4], batch size 64 with an exponentially decaying base learning rate 5e−4 and β=0:99.
The order selector of the invention is trained with Adam with 15 epochs for warm up and constant learning rate 0:005.
The shared encoder consists in an Inception-Resnet-v1 backbone with bottleneck size 512 pretrained on VGGFace2 database for face recognition [2]. Prediction heads for both VMNC and VMNS consist in dense layers with 64 units and L=20 trajectories.
Networks are trained with 30 epochs using AdamW [6] with default weight decay and learning rate 0:0005 with exponential decay (β=0:96).
For the hyperparameters of the multi-order network according to the invention, M=5!=120 permutations are used with drop-out k=32 for attribute detection and M=512; k=128 for facial action units. In both cases, 5 epochs for warmup are used.
Table 1 below draws a comparison between different versions of the multi-order network according to the invention, corresponding to different settings for hyperparameters M (number of orders) and k (number of kept orders with the proposed order dropout strategy), and with T=4, 5, 6, 7 tasks.
First of all, for T=4, 5, there is a large gap in performance between the Multi-order network according to the invention with random order k/M=1/1 and learned order (k/M=24/24 and k/M=120/120, for T=4 and 5 tasks, respectively).
This is likely due to the fact that, in the latter case, for both T=4 and T=5 tasks, the multi-order network according to the invention is able to successfully retrieve the correct order (corresponding to IT). In such a case, the network can correctly model the sequence of inter-task dependencies and learn adequate representation and prediction functions, hence superior performances.
Therefore, for nearly every run, the multi-order network of the invention manages to find the correct order IT and significantly outperforms the random order, approaching the oracle predictor with correct order.
Furthermore,
As such, it can be seen in Table 1 that the multi-order network of the invention with order dropout (16/24 and 80/120 for T=4 and 5 tasks respectively) matches the oracle performance for T=4, 5, 6 and significantly outperforms the random order baseline with larger number of tasks, e.g., T=7. As such, even for fairly high numbers of tasks (T=6, 7), the multi-order network of the invention using order dropout still manages to retrieve the correct order.
Even when it doesn't manage to do so (⅕ when T=7) the order it finds still outperforms the random order selection baseline. The drop of performances between the oracle and the learned order version can be explained by the exploration time that the multi-order network of the invention takes to find the correct order: intuitively, the smaller the k/M ratio is, the more the exploration takes over the exploitation at train time. Further investigations are presented in the supplementary section.
Finally, Table 2 shows relative performances of the multi-order network of the invention w.r.t multi-task baselines: the multi-order network of the invention displays significantly better performances than VMNC and VMNS as well as MRNN, rivaling the oracle performance. Eventually, it is demonstrated in a controlled benchmark where an optimal task chaining order is known that (a) the multi-order network of the invention is able to consistently retrieve said order, and (b) that thanks to its joint order selection mechanism and task-specific recurrent cell sharing architecture, backed by the proposed order dropout strategy, the multi-order network of the invention is able to consistently outperform other multi-task baselines, getting closer to an oracle predictor using the optimal order.
Real-world applications with potentially more complex inter-task dependencies are now considered.
Table 3 below is a comparison of the multi-order network of the invention with multi-task baselines on several attributes' subsets of CelebA.
On the one hand, there is no clear winner between the two VMN versions: for instance, VMNS performs better on the gender and accessories subsets while VMNC performs better on haircut and misc.
Those performance discrepancies may result in practical difficulties to find an all-around, well performing architecture, as echoed in [8].
Furthermore, MRNNs gets consistently outperformed by at least one of the VMN methods. In fact, MRNN recurrent cell sharing across tasks is believed to lead to early conflicts between task-associated gradients and prevents it from taking full advantage of its theoretically better inter-task relationship modelling.
The multi-order network of the invention, on the other hand, shows consistently better performances than both VMN as well as MRNN on every subset, due to its order selection mechanism that, in turn, allows to correctly model inter-task dependencies.
First, it appears that those two matrices are very similar.
Hence, the order selection mechanism of the invention is relatively stable across several networks and order selector initializations.
Second, from a qualitative point of view, it appears that the multi-order network of the invention first focuses on easier, specific tasks: it typically detects the attribute beard (which intuitively are more visible) before mustache, and lipstick (which has a very characteristic color) earlier than heavy makeup (which exhibit more variability).
Then, in the light of the prediction of the aforementioned attributes, multi-order network of the invention concludes on the sex of the subject.
The multi-order network of the invention thus selects a suitable order for predicting the sequence of tasks, enhancing the final prediction accuracy.
Table 4 below shows a comparison between the multi-order network of the invention and other multi-task approaches for facial action unit detection on DISFA database.
There is a large gap in performance between VMNC and VMNS. Furthermore, MRNN performance lies significantly lower than both VMN, likely due to the larger number of tasks, that makes it less likely to find a suitable order.
Nevertheless, due to its order selection mechanism and the proposed order dropout strategy, the multi-order network of the invention reaches significantly higher accuracies.
Table 4 also shows a comparison between the method of the invention and current state-of-the-art deep approaches on DISFA.
Those performances are all the more interesting as methods such as EAC-NET [5], LPNET [9] or JAANET [11] combines appearance features with additional geometric information based on facial landmarks, whereas the method of the invention does not.
Thus, the method of the invention outperforms existing approaches due to its ability to jointly model task order and prediction.
The above-described examples show that with the method for the invention, it is possible to retrieve the correct order on a toy dataset. Furthermore, the method of the invention significantly outperforms existing multi-task approaches as well as state-of-the-art methods for real-world multi-task problems, for instance attribute detection and facial action unit detection.
| Number | Date | Country | Kind |
|---|---|---|---|
| 21306638.4 | Nov 2021 | FR | national |
| Filing Document | Filing Date | Country | Kind |
|---|---|---|---|
| PCT/EP2022/082378 | 11/18/2022 | WO |