METHOD FOR AUTOMATICALLY TRAINING A MODEL FOR PREDICTING MULTIMEDIA DATA AND METHOD FOR DETECTING ANOMALIES BASED ON SUCH A MODEL

Description

CROSS-REFERENCE TO RELATED APPLICATION

This application claims priority to foreign French patent application No. FR 2312204, filed on Nov. 9, 2023, the disclosure of which is incorporated by reference in its entirety.

FIELD OF THE INVENTION

The invention relates to the field of machine-learning methods and concerns a new method for training a model for predicting multimedia data and a method for detecting anomalies using such a model for predicting these multimedia data. The data in question are, for example, images, sequences of images, videos, audio sequences, multispectral images or more generally data that may be multidimensional and multimodal.

BACKGROUND

Detection of anomalies consists in identifying data that are said to be “abnormal” in a given application context. Data are said to be “abnormal” when they are unusual, unpredictable or undesired. More generally, “abnormal” data may be defined as being data that deviate significantly from “normal” data in a given application context.

The abnormal character of a datum depends on the type of data and the intended application and on the context. For example, if the data are images of a component or product leaving an industrial manufacturing line, an anomaly corresponds to a visible defect in the component or product.

In the case where the data are video sequences, an anomaly corresponds, for example, to an unusual behaviour of a pedestrian in a given area.

Anomaly detection is particularly useful in the field of video surveillance, where it may be used to identify a person behaving unusually in a public or private place, in the field of autonomous driving, where it may be used to identify an unexpected obstacle on the road, and in the field of industrial quality control, where it may be used to identify a defect in a manufactured product.

In the field of machine learning, the task consisting in detecting a category of data is carried out by an optimized detector at the end of training or learning supervised by data representative of the categories to be detected.

Machine-learning methods may therefore be used to develop a model for detecting anomalies in multimedia data.

By definition, abnormal data are inherently rare and diverse compared with normal data, which are abundant. However, to avoid detection biases, it is advisable to train a detector on datasets that are balanced between the various categories.

For this reason, supervised machine-learning methods are unsuitable for dealing with the problem of anomaly detection because the task of annotating abnormal data is extremely expensive and requires a vast range of disparate anomalies to be covered depending on the context and the intended application. Moreover, these approaches are inefficient given the natural imbalance of normal data classes versus abnormal data classes.

A general problem that the invention aims to solve therefore is that of developing an unsupervised machine-learning method for detecting anomalies in multimedia data.

One-class classification machine-learning methods are better suited to the problem of anomaly detection because they use only “normal” data as input and aim to predict or reconstruct these data. The trained model is thus based on extraction of relevant features from normal data and is subsequently used to infer the degree of abnormality of new data to be evaluated.

Unsupervised learning approaches to anomaly detection are based on training an artificial intelligence model (for example a deep neural network) to perform a pretext task on normal data only. In other words, the model is not directly trained to detect anomalies but is trained to perform another task, for example reconstructing normal data, with a view to then using the model to indirectly solve the task of anomaly detection. During the inference phase, an anomaly score may be deduced from the inability of the model to perform the task correctly.

In order for the trained model to be able to effectively characterize the normality of the data and to distinguish it from anomalies, the pretext tasks must meet two necessary conditions. They must be correctly performed by the normal data and they must be incorrectly performed in the presence of anomalies. In other words, the chosen pretext tasks must induce a poor generalization of the model to anomalies.

Methods for detecting anomalies through unsupervised learning may be grouped into essentially two categories: reconstruction-based methods and prediction-based methods.

Approaches based on reconstruction aim to train a model to reconstruct, by way of output, the normal training data received as input. One assumption made by these approaches is that the reconstruction model will not be able to correctly generalize the reconstruction to anomalies, i.e. abnormal data will not be well reconstructed.

Unlike methods based on reconstruction, prediction-based approaches teach models to predict missing information, such as masked parts of normal data, in order to better learn their features.

Methods employing a reconstruction-based approach have the drawback of sometimes also correctly reconstructing abnormal data.

Reference [1] describes a reconstruction-based method for detecting anomalies that involves learning the data distribution using a multi-hypothesis autoencoder. Furthermore, the model is criticized by a discriminator, which prevents the generator from producing unlikely predictions. Autoencoders have the drawback of being capable of reconstructing abnormal features because of their extrapolation capabilities. Thus, they induce a non-negligible rate of non-detection since certain anomalies will be detected as corresponding to normal behaviours.

Methods employing a prediction-based approach generally predict anomalies poorly because training is performed only on normal data and the information to be predicted does not exist in the input data. However, these methods adapt less well to normal data because the missing information to predict is not present in the normal data, this potentially leading to prediction difficulties.

Most prediction-based methods for detecting anomalies involve learning a single prediction, this having the drawback of not reflecting the diversity of normality. Indeed, a single prediction often does not allow the diversity of a behaviour referred to as normal to be characterized. For example, consider a simple scenario of a camera observing a vehicle moving over a road and arriving at an intersection with three possibilities: turn right, go straight on or turn left—these three possible future states may all be qualified “normal”. In such a scenario, a model based on a single predictor will not be able to predict the various future states of the path of the vehicle with a single prediction. On the contrary, the prediction generated will correspond to a mean of the three possible “normal” states. If normal data are not correctly predicted by such a model, then it will also not be possible to detect anomalies through comparison.

One solution to this problem is to design a multiple-prediction model, in order to predict all the “normal” states of a datum.

Reference [2] describes a prediction-based method for detecting anomalies that involves training a model to produce a plurality of different predictions from the same masked datum. Learning a plurality of predictions makes it possible to better cover, in the masked input data, the diversity of behaviours that may be said to be normal.

The authors propose to stochastically predict normal video data using a conditional variational autoencoder. The predictions of the method are made stochastically, this meaning that the samples are not necessarily representative of the learned distribution; in addition, the anomaly score in question does not accurately quantify belongingness to the distribution of the normal data.

There is therefore a need for a new method for detecting anomalies that overcomes the drawbacks of reconstruction- or prediction-based approaches.

SUMMARY OF THE INVENTION

The proposed invention makes it possible to combine the advantages of reconstruction-based methods and prediction-based methods. The invention consists in training a multi-prediction model that does not generalize predictions well in the presence of anomalies, this improving the ability to detect anomalies. Moreover, because a plurality of predictors are used, the proposed model adapts better to normal data than a single-prediction model. The predictions made are deterministic, this ensuring the repeatability of the anomaly scores returned by the system. The predictions are also diversified, this allowing the entire distribution of the normal data to be covered, each predictor specializing in one particular pattern among all the patterns corresponding to a normal feature.

One subject of the invention is a computer-implemented method for training a model for reconstructing multimedia data represented by at least one modality, the model being composed of a set of a plurality of different predictors for each datum modality, the training method comprising the steps of, for each datum of a training dataset containing no anomalies,

- for each datum modality:
  - masking at least part of the datum modality,
  - training each predictor of the set associated with said modality to compute a different prediction of the same masked datum, each predictor being specialized in one possible prediction of the masked datum among various alternatives considered normal in the context of said multimedia data,
  - for each predictor, computing a distance between the prediction and a reference datum extracted from the training data,
  - selecting the predictor of the set corresponding to the smallest distance,
- computing a first cost function equal to the sum, over all the modalities, of the distances corresponding to each selected predictor,
- updating the parameters of the predictors selected for each modality so as to minimize the first cost function during training of the predictors.

In one particular embodiment, the method further comprises:

- for each datum modality:
  - selecting a subset of the set of predictors that have not been optimized in an earlier iteration of the training,
  - computing the sum of the distances between the respective predictions provided by the predictors of said subset and the reference datum,
  - computing a second cost function equal to the sum of said distances for all the modalities and modifying the first cost function by adding thereto the second cost function weighted by a weighting factor,
  - updating the parameters of the predictor that provides the closest prediction to the reference datum and of the predictors of said subset for each modality so as to minimize the modified first cost function.